Final Exam, Spring 2024: Topic identification

Version 1.0a-c148515

All of the header information is important. Please read it.

Topics number of exercises: This problem builds on your knowledge of basic Python, pandas, Numpy, clustering, and matrix factorization. It has 9 exercises numbered 0 to 8. There are 16 available points. However to earn 100% the threshold is 12 points. (Therefore once you hit 12 points you can stop. There is no extra credit for exceeding this threshold.)

Exercise ordering: Each exercise builds logically on previous exercises but you may solve them in any order. That is if you can't solve an exercise you can still move on and try the next one. Use this to your advantage as the exercises are not necessarily ordered in terms of difficulty. Higher point values generally indicate more difficult exercises.

Demo cells: Code cells starting with the comment ### define demo inputs load results from prior exercises applied to the entire data set and use those to build demo inputs. These must be run for subsequent demos to work properly but they do not affect the test cells. The data loaded in these cells may be rather large (at least in terms of human readability). You are free to print or otherwise use Python to explore them but we may not print them in the starter code.

Debugging you code: Right before each exercise test cell there is a block of text explaining the variables available to you for debugging. You may use these to test your code and can print/display them as needed (careful when printing large objects you may want to print the head or chunks of rows at a time).

Exercise points breakdown:

  • Ex. 0 - get_record: 2 point(s)

  • Ex. 1 - clean_record: 1 point(s)

  • Ex. 2 - records_to_metadf: 1 point(s)

  • Ex. 3 - gen_corpus: 2 point(s)

  • Ex. 4 - build_cluster_matrix: 2 point(s)

  • Ex. 5 - get_top_tokens: 2 point(s)

  • Ex. 6 - find_distinctive_tokens: 3 point(s)

  • Ex. 7 - lsa_svd_cluster__FREE: 1 point(s)

  • Ex. 8 - get_nmf_clusters: 2 point(s)

Final reminders:

  • Submit after every exercise
  • Review the generated grade report after you submit to see what errors were returned
  • Stay calm, skip problems as needed and take short breaks at your leisure

Environment setup

Run the following cells to set up the problem.

In [20]:
%load_ext autoreload
%autoreload 2

import numpy as np
import scipy as sp
import pandas as pd

import dill
from pprint import pprint
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload

The problem: Mining a computer-document corpus for "topics"

Your overall task. In this problem, the dataset is the metadata for a collection of computer science research papers and technical books. Your task will be to try to identify topics, or collections of similar documents, using three different algorithms:

  • Algorithm 1: K-means.
  • Algorithm 2: An SVD-based method.
  • Algorithm 3: NMF – nonnegative matrix factorization.

You won't have to implement these algorithms, but you will have to prepare the data and figure how to run code that implements them.

The dataset. The dataset comes from the AMiner collection. Run the following cell to load the raw data into a variable named raw_records and display a sample.

In [21]:
with open('resource/asnlib/publicdata/acm-v9-sample-100k.dill', 'rb') as fp:
    raw_records = dill.load(fp)

print(f"=== Success: Loaded {len(raw_records):,} raw records. ===")
print(f"\nExample: Records 0, 7, and 15:\n")
pprint([raw_records[k] for k in [0, 7, 15]])
=== Success: Loaded 100,000 raw records. ===

Example: Records 0, 7, and 15:

[['#*Computationally efficient algorithms for a one-time pad scheme\n',
  '#@A. G. Akritas, S. S. Lyengar, A. A. Rampuria\n',
  '#t1984\n',
  '#cInternational Journal of Parallel Programming\n',
  '#index6\n'],
 ['#*Extended person-machine interface\n',
  '#@Rachel Reichman-Adar\n',
  '#t1984\n',
  '#cArtificial Intelligence\n',
  '#index108\n'],
 ['#*Algorithms for on-the-fly garbage collection\n',
  '#@Mordechai Ben-Ari\n',
  '#t1984\n',
  '#cACM Transactions on Programming Languages and Systems (TOPLAS)\n',
  '#index269\n',
  '#%317992\n',
  '#%320265\n',
  '#%320491\n',
  '#%598024\n',
  '#%2135000\n']]

The raw_records data structure is a list of raw records. Each raw record represents the metadata for a technical paper or book, stored as a list of raw (unprocessed) strings. Each string is a raw attribute.

Each raw attribute encodes some meta-information about the paper, like its title, list of authors, year of publication, and so on. The attribute starts with a short prefix beginning with #. The prefix indicates what the attribute is, and it is immediately followed by its string value, with no delimiters separating the prefix from the value.

Here are the possible prefixes and their meaning:

Prefix Meaning
#* Title of the work
#@ Author names
#t Year of publication
#c Where the work was published
#index ID number of this work
#% ID number of a work that this work cites
#! Abstract (short description of the work)

Compare these prefixes to the example raw records printed above and observe the following:

  1. Not all prefixes are present for a given raw record.
  2. The #% prefix represents a citation, that is, the ID number of a work that this work cites. A raw record may have more than one #% attribute.

Stopwords. Per common practice, our text analysis will involve ignoring certain words, referred to as stopwords. The code cell below will load a pre-selected set of such words for some of the demos that follow.

In [22]:
with open('resource/asnlib/publicdata/STOPWORDS.dill', 'rb') as fp:
    STOPWORDS = dill.load(fp)

print("A sample of stopwords:")
sorted(list(STOPWORDS))[:7]
A sample of stopwords:
Out[22]:
['a', 'about', 'above', 'acm', 'after', 'again', 'against']

Ex. 0 (2pts): get_record

Your task: Define get_record as follows:

Convert a single 'raw' record into a specially formatted Python dictionary.

Input: lines: A list of strings, where each string is a raw attribute.

Return: record: A Python dictionary, constructed as follows.

Requirements/steps: For each raw attribute, convert its prefix into a human-readable key (see below) and extract its value as a string. Recall that a raw attribute, like '#%', may appear more than once. Therefore, the output record should associate each key with a sorted list of all values encountered.

The human-readable key corresponding to each raw attribute is:

  • '#*': 'title'
  • '#@': 'authors'
  • '#t': 'year'
  • '#c': 'venue'
  • '#index': 'id'
  • '#%': 'refs'
  • '#!': 'abstract'

Other notes: If a raw attribute appears only once, it should still appear in a list. (The list will be one-element long in this case.)

Example. A correct implementation should produce, for the demo, the following output:

{'authors': ['Mordechai Ben-Ari'],
 'id': ['269'],
 'refs': ['2135000', '317992', '320265', '320491', '598024'],
 'title': ['Algorithms for on-the-fly garbage collection'],
 'venue': ['ACM Transactions on Programming Languages and Systems (TOPLAS)'],
 'year': ['1984']}
In [23]:
### Solution - Exercise 0  

def get_record(lines: list) -> dict:
    
    # INPUT: lines is a list of strings
    
    # GOAL: Return 'record', a dictionary of lists. Convert prefix at beginning of each string in lines to correct key. 
    # List values should be sorted.
    
    # STRATEGY:
    # 1. Initialize record dictionary to hold our results
    # 2. Create a mappings dictionary to map each prefix to its corresponding key
    # 3. Iterate over prefixes
        # 4. Iterate over lines
            # 5. If the raw attribute string in lines starts with the prefix, add that raw attribute to a list, 
            #    getting rid of newline char at the end
        # 6. After iterating over all lines, if the list for that prefix is not empty, sort the list and add it to our record dictionary with the correct key
    # 7. Return record dictionary     
    
    
    # SOLUTION:
    record = {}
    mappings_dict = {'#*': 'title', '#@': 'authors', '#t': 'year', '#c': 'venue', '#index': 'id', '#%': 'refs', '#!': 'abstract'}
    
    for prefix in mappings_dict.keys():
        list_of_attributes = []
        for raw_attribute in lines:
            if raw_attribute.startswith(prefix):
            # Alternative: instead of using .startswith()
            #if prefix == raw_attribute[:len(prefix)]:
                list_of_attributes.append(raw_attribute[len(prefix):].rstrip())
        if len(list_of_attributes) > 0:
            record[mappings_dict[prefix]] = sorted(list_of_attributes)
    
    return record

### Demo function call

pprint(get_record(raw_records[15]))
{'authors': ['Mordechai Ben-Ari'],
 'id': ['269'],
 'refs': ['2135000', '317992', '320265', '320491', '598024'],
 'title': ['Algorithms for on-the-fly garbage collection'],
 'venue': ['ACM Transactions on Programming Languages and Systems (TOPLAS)'],
 'year': ['1984']}


Test cell. The cell below will test your solution for get_record (exercise 0). The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [24]:
### Test Cell - Exercise 0  

from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load

with open('resource/asnlib/publicdata/assignment_config.yaml') as f:
    ex_conf = safe_load(f)['exercises']['get_record']['config']

ex_conf['func'] = get_record

tester = Tester(ex_conf, key=b'4oTuD3jYU_dfQvrxIzSmKJQBDeiWMaODi2nmk0sQk1o=', path='resource/asnlib/publicdata/')
for _ in range(75):
    try:
        tester.run_test()
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise

###
### AUTOGRADER TEST - DO NOT REMOVE
###

print('Passed! Please submit.')
Passed! Please submit.

Run me!

Whether your solution is working or not, run the following code cell, which will preload a list of processed records (dictionaries) in the global variable, records.

In [25]:
with open('resource/asnlib/publicdata/records.dill', 'rb') as fp:
    records = dill.load(fp)

print(f"Sample post-processed records:")
for k in [0, 7, 15]:
    print(f"\n=== get_record(raw_records[{k}]) ===")
    pprint(records[k])
Sample post-processed records:

=== get_record(raw_records[0]) ===
{'authors': ['A. G. Akritas, S. S. Lyengar, A. A. Rampuria'],
 'id': ['6'],
 'title': ['Computationally efficient algorithms for a one-time pad scheme'],
 'venue': ['International Journal of Parallel Programming'],
 'year': ['1984']}

=== get_record(raw_records[7]) ===
{'authors': ['Rachel Reichman-Adar'],
 'id': ['108'],
 'title': ['Extended person-machine interface'],
 'venue': ['Artificial Intelligence'],
 'year': ['1984']}

=== get_record(raw_records[15]) ===
{'authors': ['Mordechai Ben-Ari'],
 'id': ['269'],
 'refs': ['2135000', '317992', '320265', '320491', '598024'],
 'title': ['Algorithms for on-the-fly garbage collection'],
 'venue': ['ACM Transactions on Programming Languages and Systems (TOPLAS)'],
 'year': ['1984']}

Ex. 1 (1pts): clean_record

Your task: Define clean_record as follows:

Clean a (non-raw) dictionary record.

Input: record — A Python dictionary as might be produced by get_record.

Return: cleaned — A copy of record cleaned as explained below, or None if any required keys are missing.

Requirements/steps: The goal of this task is two-fold: 1) Ensure that the record has certain "mandatory" attributes, and 2) simplify the values for certain single-value attributes. More specifically, your function should satisfy these requirements.

  • If the input is missing any of the following keys, return None: ['id', 'title', 'year', 'venue', 'abstract'].
  • For the 'id' and 'year' keys, assume there is only one value in the list and convert it into an integer and store this value.
  • If there is a 'refs' key, convert each element of its list to an integer and store this list of integers.
  • For any other keys that are present in record, assume there is only one value and store that value as-is (i.e., as a string).

Return the cleaned dictionary (or None per the first requirement).

Example. A correct implementation should produce, for the demo, the following output:

None
{'abstract': 'A mathematical model for communicating sequential processes '
             'isgiven, and a number of its interesting and useful properties '
             'arestated and proved. The possibilities of nondetermimsm are '
             'fullytaken into account.',
 'authors': 'S. D. Brookes, C. A. R. Hoare, A. W. Roscoe',
 'id': 319,
 'refs': [2135000, 318212, 320203, 374129, 408527, 547420, 555361, 565837,
          566544, 566549, 680046, 689430],
 'title': 'A Theory of Communicating Sequential Processes',
 'venue': 'Journal of the ACM (JACM)',
 'year': 1984}
In [26]:
### Solution - Exercise 1  

def clean_record(record: dict) -> dict:
    
    # INPUT: record is a dictionary of lists
    
    # GOAL: Return cleaned dictionary. 
    # 1. If the input is missing any of the following keys, return None: ['id', 'title', 'year', 'venue', 'abstract'].
    # 2. For the 'id' and 'year' keys, assume there is only one value in the list and convert it into an integer 
    # 3. If there is a 'refs' key, convert each element of its list to an integer and store this list of integers.
    # 4. For any other keys that are present in record, assume there is only one value and store that value as-is 
    
    # STRATEGY:
    # 1. First check to make sure all required keys exist in 'record'
    # 2. Initialize empty dictionary 'cleaned' to hold our results
    # 3. Iterate over key/value pairs in record dictionary
            # 4. If key is 'id' or 'year', convert 0th entry in list to an integer and store in 'cleaned'
            # 5. If key is 'refs', convert each element of list to integer and store list in 'cleaned'
            # 6. Anything else, store 0th entry in list in 'cleaned'
    # 7. Return 'cleaned' dictionary
        
    # SOLUTION:
    required_keys = ['id', 'title', 'year', 'venue', 'abstract']
    
    for key in required_keys:
        if key not in record:
            return None
    
    cleaned = {}
    
    for key, val in record.items():
        if key == 'id' or key == 'year':
            cleaned[key] = int(val[0])
        elif key == 'refs':
            cleaned[key] = [int(e) for e in val]
        else:
            cleaned[key] = val[0]
    
    return cleaned

### Demo function call

pprint(clean_record(records[0])) # None!
pprint(clean_record(records[17])) # Valid
None
{'abstract': 'A mathematical model for communicating sequential processes '
             'isgiven, and a number of its interesting and useful properties '
             'arestated and proved. The possibilities of nondetermimsm are '
             'fullytaken into account.',
 'authors': 'S. D. Brookes, C. A. R. Hoare, A. W. Roscoe',
 'id': 319,
 'refs': [2135000,
          318212,
          320203,
          374129,
          408527,
          547420,
          555361,
          565837,
          566544,
          566549,
          680046,
          689430],
 'title': 'A Theory of Communicating Sequential Processes',
 'venue': 'Journal of the ACM (JACM)',
 'year': 1984}


Test cell. The cell below will test your solution for clean_record (exercise 1). The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [27]:
### Test Cell - Exercise 1  

from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load

with open('resource/asnlib/publicdata/assignment_config.yaml') as f:
    ex_conf = safe_load(f)['exercises']['clean_record']['config']

ex_conf['func'] = clean_record

tester = Tester(ex_conf, key=b'BhZT_oWZZqOjWxKNFXA5A7uL44XtRE6X7vsh1IcmmOs=', path='resource/asnlib/publicdata/')
for _ in range(75):
    try:
        tester.run_test()
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise

###
### AUTOGRADER TEST - DO NOT REMOVE
###

print('Passed! Please submit.')
Passed! Please submit.

Run me!

Whether your solution is working or not, run the following code cell, which will preload a list of cleaned records in the global variable, cleaned_records.

In [28]:
with open('resource/asnlib/publicdata/papers.dill', 'rb') as fp:
    cleaned_records = dill.load(fp)

print("Examples of cleaned records:")
for k in [0, len(cleaned_records)//2, len(cleaned_records)-1]:
    record = cleaned_records[k]
    print(f"\n=== Record {k} ===")
    pprint(record)
Examples of cleaned records:

=== Record 0 ===
{'abstract': 'A mathematical model for communicating sequential processes '
             'isgiven, and a number of its interesting and useful properties '
             'arestated and proved. The possibilities of nondetermimsm are '
             'fullytaken into account.',
 'authors': 'S. D. Brookes, C. A. R. Hoare, A. W. Roscoe',
 'id': 319,
 'refs': [2135000,
          318212,
          320203,
          374129,
          408527,
          547420,
          555361,
          565837,
          566544,
          566549,
          680046,
          689430],
 'title': 'A Theory of Communicating Sequential Processes',
 'venue': 'Journal of the ACM (JACM)',
 'year': 1984}

=== Record 36141 ===
{'abstract': 'Despite the dawn of the XML era, semantic interoperability '
             'issues still remain unsolved. As various initiatives trying to '
             'address how the underlying business information should be '
             'modelled, named and structured are being realised throughout the '
             'world, the importance of moving towards a holistic approach in '
             'eBusiness magnifies. In this paper, an attempt to clarify '
             'between the standards prevailing in the area is performed and '
             'the XML Data Standards providing generic XML Schemas are '
             "presented. Based on this ''XML Data Standards Map'', a "
             'multi-faceted classification mechanism is proposed, leading to '
             'an extensible taxonomy of standards. A set of facets is analyzed '
             'for each standard, allowing for their classification based on '
             'their scope, completeness, compatibility with other standards, '
             'openness, ability to modify the schemas and maturity, to name a '
             'few. Through populating and querying this multi-faceted '
             'classification, a common understanding of Data Integration '
             'Standards can be ensured and the choice of a standard according '
             'to the requirements of each business can be systematically '
             'addressed.',
 'authors': 'Fenareti Lampathaki, Spiros Mouzakitis, George Gionis, Yannis '
            'Charalabidis, Dimitris Askounis',
 'id': 1249240,
 'refs': [1023578,
          1294868,
          2135000,
          346625,
          347823,
          413758,
          438312,
          448191,
          576618,
          577523,
          629245,
          728956,
          830436,
          932129],
 'title': 'Business to business interoperability: A current review of XML data '
          'integration standards',
 'venue': 'Computer Standards & Interfaces',
 'year': 2009}

=== Record 72281 ===
{'abstract': 'The paper presents a pilot research on the application of '
             'clinical decision support systems in a atrophic gastritis '
             'screening task. Two different DSS learning strategies have been '
             'tested - a standalone classifier and classifier ensemble '
             'application. Such classification algorithms as C4.5, CART, JRip '
             'and Naive Bayes were used as base classifiers. The classifiers '
             'were evaluated on the respondent medical data from an inquiry '
             'form, containing 28 attributes and 840 records. The dataset was '
             'preprocessed using simple methods in initial data analysis as '
             'well as more complex data mining methods for feature selection. '
             'The obtained results are summarized and discussed in order to '
             'summarize an information on what learning strategies are more '
             'applicable to the present dataset and should be studied in more '
             'detail in primary research.',
 'authors': 'Sergei Parshutin, Arnis Kirshners',
 'id': 2002090,
 'refs': [1023499, 1129901, 1223568, 1396386, 1842708, 2135000, 889706],
 'title': 'Research on clinical decision support systems development for '
          'atrophic gastritis screening',
 'venue': 'Expert Systems with Applications: An International Journal',
 'year': 2013}

Ex. 2 (1pts): records_to_metadf

Your task: Define records_to_metadf as follows:

Convert (cleaned) records, which represent publications metadata, into a pandas DataFrame.

Input: records – A list of cleaned records (e.g., see clean_records).

Return: metadf – A pandas DataFrame of the "metadata, as defined below."

Requirements/steps: Construct a pandas DataFrame with columns for these keys: 'id', 'title', 'year', 'venue', 'abstract'

Other notes: The target fields are guaranteed to exist based on the cleaning procedure.

Example. The demo runs the function on elements [11222, 11239, 12329] of cleaned_records. A correct implementation would produce:

id title year venue abstract
648046 A Tutorial for CC++ 1994 A Tutorial for CC++ No abstract available.
648461 Logics of Programs 1989 Logics of Programs None Available
673426 Codes and Number Theory 1997 Codes and Number Theory Codes and Number Theory
In [29]:
### Solution - Exercise 2  

def records_to_metadf(records: list) -> pd.DataFrame:
    
    # INPUT: records is a list of cleaned records (dictionaries)
    
    # GOAL: Return metadf data frame with columns: id, title, year, venue, abstract
    
    # STRATEGY: 
    # 1. Create metadf dataframe from 'records' (list of dictionaries)
    # 2. Filter metadf down to the columns we care about ('id', 'title', 'year', 'venue', 'abstract')
    # 3. Return   
    
    # SOLUTION:
    # See section 'Create DataFrame from List of Dictionaries'
    # https://www.geeksforgeeks.org/different-ways-to-create-pandas-dataframe/
    
    metadf = pd.DataFrame(records)
    columns_we_want = ['id', 'title', 'year', 'venue', 'abstract']
    
    return metadf[columns_we_want]

### Demo function call

demo_cleaned_records = [cleaned_records[k] for k in [11222, 11239, 12329]]
demo_metadf = records_to_metadf(demo_cleaned_records)
display(demo_metadf)
id title year venue abstract
0 648046 A Tutorial for CC++ 1994 A Tutorial for CC++ No abstract available.
1 648461 Logics of Programs 1989 Logics of Programs None Available
2 673426 Codes and Number Theory 1997 Codes and Number Theory Codes and Number Theory


Test cell. The cell below will test your solution for records_to_metadf (exercise 2). The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [30]:
### Test Cell - Exercise 2  

from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load

with open('resource/asnlib/publicdata/assignment_config.yaml') as f:
    ex_conf = safe_load(f)['exercises']['records_to_metadf']['config']

ex_conf['func'] = records_to_metadf

tester = Tester(ex_conf, key=b'hgW1Cvvo9Kg4sTImuGlhMic09mEdkjjQQFPMCiw5gaE=', path='resource/asnlib/publicdata/')
for _ in range(75):
    try:
        tester.run_test()
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise

###
### AUTOGRADER TEST - DO NOT REMOVE
###

print('Passed! Please submit.')
Passed! Please submit.

Run me!

Whether your solution is working or not, run the following code cell, which will preload a dataframe of metadata records in the global variable, metadf.

In [31]:
with open('resource/asnlib/publicdata/metadf.dill', 'rb') as fp:
    metadf = dill.load(fp)

# For Ex. 3, break dependence with a working Ex. 2
demo_metadf = metadf.loc[[11222, 11239, 12329]].reset_index(drop=True)

print("Examples of metadata records:")
display(metadf.sample(5))
Examples of metadata records:
id title year venue abstract
51477 1570496 Parallel controller design and synthesis 2010 Proceedings of the 7th FPGAworld Conference Petri nets provide an adequate means to visual...
483 66858 Load balancing and task decomposition techniqu... 1989 Proceedings of the 1989 ACM/IEEE conference on... Integrated vision systems employ a sequence of...
23229 973704 On Rigorous Design and Implementation of Fault... 2007 ISORC '07 Proceedings of the 10th IEEE Interna... In this paper we present our vision of establi...
10895 639513 An asset view on the software process 1996 ISPW '96 Proceedings of the 10th International... The development and reuse of software engineer...
53868 1620595 Improved prediction of biomass composition for... 2012 Expert Systems with Applications: An Internati... Fourier transform near-infrared (FT-NIR) techn...

Ex. 3 (2pts): gen_corpus

Your task: Define gen_corpus as follows:

Construct a "document corpus" that can be fed into some text analysis tools. This corpus consists of "pseudo-documents," built up from the documents' "tokens."

Inputs:

  • metadf – A dataframe with document metadata, as produced by records_to_metadf.
  • stopwords - A Python set containing tokens to ignore.

Return: pseudodf – A new dataframe where tokens have been extracted and filtered from some of the attributes, formatted as explained below.

Requirements/steps: Your dataframe should have the following columns:

  • 'id': The document ID, taken as-is from the input, metadf['id'].
  • 'title': The document title, taken as-is from metadf['title'].
  • 'pseudodoc': A "pseudo-document," which is a string built from the 'title', 'venue', and 'abstract' as follows.

Recall that each row of metadf represents a document. For each document, its pseudo-document is constructed by extracting tokens from lowercase versions of its title, venue, and abstract strings. When extracting tokens, regard a consecutive sequence of plain alphabetic characters, 'a' through 'z', as a token. (Regex may be helpful here!) To construct the pseudo-document from these tokens, first sort the unique tokens, remove any stopwords, and then join the remaining tokens using a single space character, ' ', to separate them.

Other notes: Per the above requirements, do not forget to:

  1. extract your unique tokens from lowercase versions of title, venue, and abstract;
  2. remove any stopwords using the input stopwords (and not the global variable, STOPWORDS, which is for demos); and
  3. concatenate only the unique tokens, in sorted order, and separated by spaces.

Example: Suppose we run gen_corpus on the demo metadata dataframe from the previous exercise. A correct implementation would return:

id title pseudodoc
648046 A Tutorial for CC++ abstract available cc tutorial
648461 Logics of Programs available logics none programs
673426 Codes and Number Theory codes number theory
In [32]:
### Solution - Exercise 3  

def gen_corpus(metadf, stopwords):
    
    # INPUT: 
    # 1. metadf - a data frame 
    # 2. stopwords - a set of words we should exclude
    
    # GOAL:
    # Return 'pseudodf' a data frame with columns:
    #     'id': The document ID, taken as-is from the input, metadf['id']
    #     'title': The document title, taken as-is from metadf['title']
    #     'pseudodoc': A "pseudo-document," which is a string built from the 'title', 'venue', and 'abstract' as follows:
                # 1. get unique tokens from lowercase of title, venue, and abstract (consecutive sequence of a-z only)
                # 2. remove stopwords
                # 3. concatenate only the unique tokens, in sorted order, and separated by spaces
    
    # STRATEGY:
    # 1. Make a copy of metadf. Call this 'df'
    # 2. Create an empty list to hold our pseudodoc column data
    # 3. Iterate over rows in df
            # 4. Combine lowercase of columns 'title', 'venue', and 'abstract' (separate by space) into one big string
            # 5. Find list of words containing a-z characters
            # 6. Find unique words in Step 5 list (use set?)
            # 7. Remove all stopwords from Step 6 and sort
            # 8. Join Step 7 list of words back together into one string using single space character
            # 9. Add to 'pseudodoc' list
    # 10. Add 'pseudodoc' list as a column to our df
    # 11. Filter df to columns we care about (id, title, pseudodoc) and return 
    
    # SOLUTION:
    import re
    
    df = metadf.copy()
    pseudodoc = []
    
    for index, row in df.iterrows():
        all_words_str = row['title'].lower() + ' ' + row['venue'].lower() + ' ' + row['abstract'].lower()
        word_list = re.findall('[a-z]+', all_words_str)
        
        # Alternative way if you don't want to use regex:
#         new_words_str = ''
#         for char in all_words_str:
#             if char in 'abcdefghijklmnopqrstuvwxyz':
#                 new_words_str += char
#             else:
#                 new_words_str += ' '
        
#         word_list = new_words_str.split()

        unique_words = set(word_list)
        final_list_of_words = sorted(unique_words.difference(stopwords))
        pseudodoc.append(' '.join(final_list_of_words))
    
    df['pseudodoc'] = pseudodoc
    return df[['id', 'title', 'pseudodoc']]
    


### Demo function call
demo_corpus = gen_corpus(demo_metadf, STOPWORDS)
display(demo_corpus)
id title pseudodoc
0 648046 A Tutorial for CC++ abstract available cc tutorial
1 648461 Logics of Programs available logics none programs
2 673426 Codes and Number Theory codes number theory


Test cell. The cell below will test your solution for gen_corpus (exercise 3). The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [33]:
### Test Cell - Exercise 3  

from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load

with open('resource/asnlib/publicdata/assignment_config.yaml') as f:
    ex_conf = safe_load(f)['exercises']['gen_corpus']['config']

ex_conf['func'] = gen_corpus

tester = Tester(ex_conf, key=b'o0kicBPd2HinEbuciVMQHNjfCLJj4eVAWPBqKCfSeNw=', path='resource/asnlib/publicdata/')
for _ in range(75):
    try:
        tester.run_test()
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise

###
### AUTOGRADER TEST - DO NOT REMOVE
###

print('Passed! Please submit.')
Passed! Please submit.

Run me!

Whether your solution is working or not, run the following code cell, which will preload a corpus of pseudo-documents in the global dataframe, corpusdf.

In [34]:
with open('resource/asnlib/publicdata/corpus.dill', 'rb') as fp:
    corpusdf = dill.load(fp)

print(f"A sample of the pseudo-document corpus:")
display(corpusdf.sample(4))
A sample of the pseudo-document corpus:
id title pseudodoc
11776 660805 LART: Flexible, Low-Power Building Blocks for ... abstract alternatively blocks board boards bui...
40503 1338378 Making sense of meaning: leveraging social pro... activity addressed allows areas artifact aspec...
12613 681189 Strokes from Pen-Opposed Extended Edges across characters correlation digital directio...
67725 1907215 Why virtual desktop at CCRI?: finding sustaina... additionally allows annual anytime anywhere ap...

Mini-tutorial (read & run me!): k-means clustering + matrix factorization

A "classical" way to represent the corpus for data mining is in the bag of words model. In this model, we transform a corpus consisting of n documents and m unique tokens into an n×m (sparse) matrix X, where element xij "measures" the strength of the relationship between document i and token j.

Using one of scikit-learn's text feature extraction tools, we constructed a matrix from the corpus. Run the next code cell to load the matrix into the global variable, X.

In [35]:
with open('resource/asnlib/publicdata/X.dill', 'rb') as fp:
    X = dill.load(fp)
    
print(f"The shape of `X`: {X.shape}")
print(f"- Smallest value: {X.min()}")
print(f"- Largest value: {X.max()}")
print(f"- Mean value: {X.mean()}")
The shape of `X`: (72282, 108145)
- Smallest value: 0.0
- Largest value: 1.0
- Mean value: 6.840847576497508e-05

Here is "spy plot" of X, which visualizes the locations of the nonzero entries.

Spy plot showing the nonzero structure of the sparse matrix, `X`

A topic model via k-means. Given X, we can then use scikit-learn's K-means implementation to cluster the documents. It works by (magically) calculating a vector-space representation of the documents of X and then running K-means on it similar to what you did in Notebook 14. We can interpret each cluster as a "topic" that groups documents containing similar tokens together.

To save a little time, we ran this clustering for you to try to find 10 topics (clusters). (The code we used appears at the end of the exam.)

Run the cell below to load the precomputed clusters, which consists of the following:

  • km_sizes: The number of documents in each cluster.
  • km_labels: A 1-D array where km_labels[i] is the the cluster ID number (0-9) to which i is assigned.
  • km_centers: A 2-D array where row km_centers[j, :] is the coordinate of the center of cluster j.
In [36]:
with open('resource/asnlib/publicdata/kmeans.dill', 'rb') as fp:
    _, km_sizes, km_centers, km_labels = dill.load(fp)

print(f"- Cluster sizes:", km_sizes)
print(f"- Cluster labels:", km_labels)
print(f"- Cluster centers, {km_centers.shape[0]:,} x {km_centers.shape[1]}, where each column is a centroid:")
pprint(km_centers)
- Cluster sizes: [10136  6764  4193  5724  5270  2855  7475  9599 13564  6702]
- Cluster labels: [7 7 7 ... 6 5 6]
- Cluster centers, 108,145 x 10, where each column is a centroid:
array([[1.60634653e-05, 1.10854239e-04, 0.00000000e+00, ...,
        6.51034007e-05, 2.60202046e-05, 0.00000000e+00],
       [0.00000000e+00, 2.31752030e-04, 1.76940759e-04, ...,
        0.00000000e+00, 5.29350135e-05, 0.00000000e+00],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        0.00000000e+00, 5.24392544e-05, 1.89033683e-05],
       ...,
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        2.26129576e-05, 0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 2.06486921e-05, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00]])

The "matrix factorization" interpretation of k-means. Although we commonly think of k-means as clustering points, we can also view what it does as computing a certain kind of matrix factorization.

Recall that X had m rows (documents) and n columns (tokens), and we identified k topics using k-means clustering. Let C be the n×k matrix of centroids that resulted.

What k-means does is assign each document i to one of these centroids. Let R be a m×k (sparse) matrix where entry ris=1 if document i is assigned to topic s and equals 0 otherwise. Then one way to interpret k-means is to say that it approximates the matrix X by the product,

XRCT.

To see why, consider let r^iT be the i-th row of R and cs be the s-th column of C. By the definition of matrix product,

r^iTCT=ri^T[c0csck1]T=[ri0risri,k1][c0TcsTck1T]=sriscsT.

However, since k-means assigned document i to exactly one cluster s, then only one term in the sum is nonzero, which is cs, the centroid for cluster s. In other words, the action of RCT is to assign each document (row) of X to one centroid.

With this concept in mind, you are now ready for the next exercise.

Ex. 4 (2pts): build_cluster_matrix

Your task: Define build_cluster_matrix as follows:

Given a k-means clustering result, construct a k-means assignment matrix. From the preceding discussion, it is the R matrix.

Inputs:

  • labels: A 1-D array where labels[i] is the cluster ID assigned to document i.
  • max_label: The largest cluster ID plus 1. That is, 0 <= labels[i] < max_label for all i.

Return: R, a Scipy COO matrix (scipy.sparse.coo_matrix).

Requirements/steps: The returned matrix R should have len(labels) rows and max_label columns. It should only have entries equal to 1.0 at positions (i, j) where j == labels[i]; all other entries should be regarded as zero values and, therefore, not stored explicitly.

Example: Suppose we run:

build_cluster_matrix(np.array([1, 2, 0, 0, 2, 2, 2, 3, 0, 3]), 4)

A correct implementation would return:

(0, 1)  1.0
  (1, 2)    1.0
  (2, 0)    1.0
  (3, 0)    1.0
  (4, 2)    1.0
  (5, 2)    1.0
  (6, 2)    1.0
  (7, 3)    1.0
  (8, 0)    1.0
  (9, 3)    1.0

Here is a picture (spy plot): Spy plot: demo `R` matrix

In [53]:
### Solution - Exercise 4  

def build_cluster_matrix(labels: np.ndarray, max_label: int) -> sp.sparse.coo_matrix:
    
    # INPUT: 
    # 1. labels: A 1-D array where labels[i] is the cluster ID assigned to document i
    # 2. max_label: The largest cluster ID plus 1. That is, 0 <= labels[i] < max_label for all i

    # GOAL: Return Scipy coo matrix, with len(labels) rows and max_label columns. 
    # It should only have entries equal to 1.0 at positions (i, j) where j == labels[i]; 
    # all other entries should be regarded as zero values and, therefore, not stored explicitly.

    # STRATEGY:
    # 1. Create a 2-D array of zeros with len(labels) rows and max_label columns
    # 2. Iterate over labels and set the corresponding i, j locations to 1.0
    # 3. Convert to a coo matrix and return
    
    # SOLUTION:   
    full_arr = np.zeros((len(labels), max_label))
    
    for i, j in enumerate(labels):
        full_arr[i,j] = 1.0
    
    return sp.sparse.coo_matrix(full_arr)

    # -------------------------------------------------------------
    
    # Alternative solution building out data, row, and column arrays (instead of populating 1.0s in giant array):
    
    # Create coo matrix using this syntax: coo_matrix((data, (i, j)), shape=(M, N))
    # Construct from three arrays:
    # 1. data[:] the entries of the matrix, in any order
    # 2. i[:] the row indices of the matrix entries
    # 3. j[:] the column indices of the matrix entries
    
    data = []
    i_list = []
    j_list = []
    
    for i, j in enumerate(labels):
        data.append(1.0)
        i_list.append(i)
        j_list.append(j)
    
    return sp.sparse.coo_matrix((np.array(data), (np.array(i_list), np.array(j_list))), shape=(len(labels), max_label))



### Demo function call
demo_R = build_cluster_matrix(np.array([1, 2, 0, 0, 2, 2, 2, 3, 0, 3]), 4)
print(demo_R)
  (0, 1)	1.0
  (1, 2)	1.0
  (2, 0)	1.0
  (3, 0)	1.0
  (4, 2)	1.0
  (5, 2)	1.0
  (6, 2)	1.0
  (7, 3)	1.0
  (8, 0)	1.0
  (9, 3)	1.0


Test cell. The cell below will test your solution for build_cluster_matrix (exercise 4). The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [54]:
### Test Cell - Exercise 4  

from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load

with open('resource/asnlib/publicdata/assignment_config.yaml') as f:
    ex_conf = safe_load(f)['exercises']['build_cluster_matrix']['config']

ex_conf['func'] = build_cluster_matrix

tester = Tester(ex_conf, key=b'cub7HYh_bdtYjSXeD5UPThVliqosSf9XWu4LLxZjlOs=', path='resource/asnlib/publicdata/')
for _ in range(75):
    try:
        tester.run_test()
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise

###
### AUTOGRADER TEST - DO NOT REMOVE
###

print('Passed! Please submit.')
Passed! Please submit.

Run me!

Whether your solution is working or not, run the following code cell, which will preload the k-means cluster matrix, R, for X (X), in the global variable R_km.

In [39]:
with open('resource/asnlib/publicdata/R.dill', 'rb') as fp:
    R_km = dill.load(fp)
    
R_km
Out[39]:
<72282x10 sparse matrix of type '<class 'numpy.float64'>'
	with 72282 stored elements in COOrdinate format>

Ex. 5 (2pts): get_top_tokens

Your task: Define get_top_tokens as follows:

Given a cluster, identify its most frequent tokens.

Inputs:

  • cid: The ID of a cluster to analyze.
  • labels: Cluster assignments, where document i was assigned to cluster labels[i].
  • corpusdf: A corpus / pseudo-document dataframe (see Ex. 3, gen_corpus), having the columns 'id', 'title', and 'pseudodoc'.
  • k: The number of tokens to return.

Return: top_tokens, a Python set of the k most frequent tokens.

Requirements/steps:

  1. Use labels to identify the documents belonging to cluster cid.
  2. Select their corresponding pseudo-documents corpusdf. See the note below.
  3. For each unique token, count the number of pseudo-documents in which it appeared.
  4. Return the k most frequently occurring tokens as a Python set. In the case of ties, consider tokens in ascending order.

Other notes:

  • To match a document ID i to its pseudo-document, note that i is the index value of corpusdf (and, in particular, not the 'id' column!).
  • If there are fewer than k unique tokens, return as many as available.
  • If there are ties, use the token itselfsort the tokens by name.

Example: For the demo code, a correct implementation should return:

{'page', 'reviews', 'academic', 'book'}
In [40]:
### Solution - Exercise 5  

def get_top_tokens(cid: int, labels: np.ndarray, corpusdf: pd.DataFrame, k=10) -> set:
    
    # INPUT: 
    # 1. cid: The ID of a cluster to analyze
    # 2. labels: 1-D array. Cluster assignments, where document i was assigned to cluster labels[i]
    # 3. corpusdf: A pseudo-document dataframe, having the columns 'id', 'title', and 'pseudodoc'
    # 4. k: integer - the number of tokens to return
    
    # GOAL:
    # Return top_tokens, a Python set of the k most frequent tokens
    
    # STRATEGY:
    # 1. Use labels to identify the documents belonging to cluster cid
    # 2. Select their corresponding pseudo-documents corpusdf. To match a document ID i to its pseudo-document, 
    #    note that i is the index value of corpusdf (and, in particular, not the 'id' column!)
    # 3. For each unique token, count the number of pseudo-documents in which it appeared.
    # 4. Return up to the k most frequently occurring tokens as a Python set (so we need to sort).
    #    In the case of ties, consider tokens in ascending order.
    
    # SOLUTION:
    # Create a big list of all of the tokens from documents associated with our desired cluster id cid:
    token_list = []
    
    for i, cluster_id in enumerate(labels):
        if cluster_id == cid:
            pseudodoc = corpusdf['pseudodoc'].loc[i]
            token_list += pseudodoc.split()
    
    # Now count occurrences of all the tokens in token_list:
    from collections import Counter
    
    counter_dict = Counter(token_list)
    
    # Sort by count value in descending order, then break ties by key value in ascending order:
    sorted_list_of_tuples = sorted(counter_dict.items(), key=lambda x: (-x[1], x[0]))
    
    # Grab top k elements and get only the keys (not the values):
    top_k_list_of_tuples = sorted_list_of_tuples[:k]
    top_k_list = [x[0] for x in top_k_list_of_tuples]
    
    # Return as a set:
    return set(top_k_list)

### Demo function call
with open('resource/asnlib/publicdata/demo_args_get_top_tokens.dill', 'rb') as fp:
    demo_cid, demo_labels, demo_corpusdf, demo_k = dill.load(fp)
print(get_top_tokens(demo_cid, demo_labels, demo_corpusdf, demo_k))
{'book', 'page', 'reviews', 'academic'}


Test cell. The cell below will test your solution for get_top_tokens (exercise 5). The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [41]:
### Test Cell - Exercise 5  

from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load

with open('resource/asnlib/publicdata/assignment_config.yaml') as f:
    ex_conf = safe_load(f)['exercises']['get_top_tokens']['config']

ex_conf['func'] = get_top_tokens

tester = Tester(ex_conf, key=b'BBpXHA04_pc-q0tx10aYqwXtRUfiJeKp90vRPTLzC5k=', path='resource/asnlib/publicdata/')
for _ in range(75):
    try:
        tester.run_test()
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise

###
### AUTOGRADER TEST - DO NOT REMOVE
###

print('Passed! Please submit.')
Passed! Please submit.

Run me!

We ran one of our reference solutions for get_top_tokens on corpusdf to extract top tokens for all the clusters discovered by a K-means run. Run the following code cell to load them into a global dictionary called top_tokens_km.

In [42]:
with open('resource/asnlib/publicdata/top_tokens.dill', 'rb') as fp:
    km_top_tokens = dill.load(fp)
    
km_top_tokens
Out[42]:
{0: {'analysis',
  'control',
  'data',
  'model',
  'performance',
  'simulation',
  'system',
  'systems',
  'time',
  'two'},
 1: {'applications',
  'computing',
  'data',
  'distributed',
  'information',
  'network',
  'service',
  'services',
  'system',
  'systems'},
 2: {'mobile',
  'network',
  'networks',
  'nodes',
  'performance',
  'show',
  'simulation',
  'systems',
  'time',
  'wireless'},
 3: {'applications',
  'data',
  'design',
  'high',
  'parallel',
  'performance',
  'show',
  'system',
  'systems',
  'time'},
 4: {'algorithm',
  'analysis',
  'data',
  'image',
  'images',
  'information',
  'processing',
  'recognition',
  'show',
  'two'},
 5: {'algorithm',
  'artificial',
  'data',
  'information',
  'intelligence',
  'learning',
  'model',
  'system',
  'systems',
  'volume'},
 6: {'algorithm',
  'analysis',
  'data',
  'information',
  'methods',
  'model',
  'performance',
  'show',
  'systems',
  'two'},
 7: {'algorithm',
  'algorithms',
  'linear',
  'number',
  'problems',
  'set',
  'show',
  'theory',
  'time',
  'two'},
 8: {'data',
  'design',
  'development',
  'information',
  'research',
  'study',
  'system',
  'systems',
  'technology',
  'work'},
 9: {'applications',
  'design',
  'development',
  'engineering',
  'language',
  'model',
  'programming',
  'software',
  'system',
  'systems'}}

Ex. 6 (3pts): find_distinctive_tokens

Your task: Define find_distinctive_tokens as follows:

Suppose you are given the top tokens of several clusters. For each cluster, determine its "most distinctive" tokens and keep only those.

Inputs:

  • token_sets: A dictionary, where token_sets[i] are the top tokens appearing in cluster i.
  • max_occur: The maximum number of allowable occurrences of a token across clusters for that token to be considered "distinctive" of that cluster.

Return:

  • distinctive: A new dictionary mapping cluster key to a (possibly smaller) set of tokens distictive[key].

Requirements/steps:

  1. For each token, determine how many clusters it belongs to.
  2. For each cluster, consider its distinctive tokens to be only those that appear in at most max_occur clusters. If a cluster has no distinctive tokens, do not include it in the output.
  3. Construct and return a new dictionary mapping each cluster ID to its distinctive tokens.

Example: For the demo code, a correct implementation should return:

{0: {'analysis', 'simulation', 'control'}, 4: {'processing', 'images', 'image', 'analysis', 'recognition'}, 6: {'analysis', 'methods'}, 2: {'network', 'nodes', 'networks', 'simulation', 'wireless', 'mobile'}, 1: {'network', 'distributed', 'service', 'computing', 'services', 'applications'}, 3: {'parallel', 'applications', 'design', 'high'}, 9: {'engineering', 'programming', 'language', 'software', 'applications', 'design', 'development'}, 8: {'work', 'technology', 'study', 'research', 'design', 'development'}, 5: {'artificial', 'intelligence', 'learning', 'volume'}, 7: {'problems', 'algorithms', 'number', 'linear', 'theory', 'set'}}
In [43]:
### Solution - Exercise 6  

def find_distinctive_tokens(token_sets: dict, max_occur=3) -> dict:
    # INPUT: 
    # 1. token_sets is a dictionary of sets, where token_sets[i] are the top tokens appearing in cluster i
    # 2. max_occur is integer, maximum number of allowable occurrences of a token across clusters for that token 
    #    to be considered "distinctive" of that cluster.
    
    # GOAL:
    # Return 'distinctive': A new dictionary mapping cluster key to a set of distinctive tokens
    # So we're filtering out the common tokens, and only keeping the distinctive ones
    
    # STRATEGY:
    # 1. Get a list of all of the tokens in the 'token_sets' dictionary (aka combine all of the sets into one big list)
    # 2. Count up occurrences of each token in token list created in Step 1. Let's create a Counter dictionary.
    # 3. Initialize a 'distinctive' default dictionary of sets to hold our new dictionary results
    # 4. Iterate over key, value pairs in 'token_sets' dictionary:
            # 5. Iterate over tokens in that key's set of values:
                    # 6. If token has less than or equal to max_occur in our Counter dictionary, he's good 
                    #    and we add him to our new 'distinctive' dictionary set for that key
    # 7. Return 'distinctive' dictionary
    
    # SOLUTION:
    from collections import Counter
    from collections import defaultdict
    
    list_of_tokens = [val for subset in token_sets.values() for val in subset]
    token_count_dict = Counter(list_of_tokens)
    
    distinctive = defaultdict(set)
    
    for cluster_id, set_of_tokens in token_sets.items():
        for token in set_of_tokens:
            if token_count_dict[token] <= max_occur:
                distinctive[cluster_id].add(token)
    
    return dict(distinctive)

### Demo function call
print(find_distinctive_tokens(km_top_tokens))
{0: {'control', 'simulation', 'analysis'}, 1: {'applications', 'services', 'network', 'computing', 'service', 'distributed'}, 2: {'network', 'simulation', 'mobile', 'nodes', 'networks', 'wireless'}, 3: {'design', 'parallel', 'high', 'applications'}, 4: {'image', 'recognition', 'processing', 'analysis', 'images'}, 5: {'volume', 'learning', 'artificial', 'intelligence'}, 6: {'methods', 'analysis'}, 7: {'theory', 'set', 'problems', 'algorithms', 'number', 'linear'}, 8: {'design', 'technology', 'development', 'research', 'study', 'work'}, 9: {'language', 'applications', 'programming', 'design', 'software', 'engineering', 'development'}}


Test cell. The cell below will test your solution for find_distinctive_tokens (exercise 6). The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [44]:
### Test Cell - Exercise 6  

from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load

with open('resource/asnlib/publicdata/assignment_config.yaml') as f:
    ex_conf = safe_load(f)['exercises']['find_distinctive_tokens']['config']

ex_conf['func'] = find_distinctive_tokens

tester = Tester(ex_conf, key=b'5VeeLsiYTK9AUwJ8sN5jiO4xU3sWzBJooV29Q6R7Erk=', path='resource/asnlib/publicdata/')
for _ in range(75):
    try:
        tester.run_test()
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise

###
### AUTOGRADER TEST - DO NOT REMOVE
###

print('Passed! Please submit.')
Passed! Please submit.

Ex. 7 (1pts): lsa_svd_cluster__FREE

Your task: Define lsa_svd_cluster__FREE as follows:

FREE EXERCISE: Use the SVD and K-means to compute a clustering. This method is sometimes referred to as latent semantic analysis (LSA).

Background: In the data-matrix factorization view of K-means, we compute XRCT and use R to identify clusters. That suggests we could try other factorization methods, including the singular value decomposition (SVD) from Notebook 15.

In this strategy, we compute an d-truncated SVD, XUdΣdVdT, where we choose d to be "small" compared to the number of columns of X. In so doing, the matrix YUdΣd acts as lower-dimensional representation of X, where each row of Y is a d-dimensional version of the corresponding row of X. We can then run K-means on Y.

Here, we are giving you some code to carry out this analysis. This code uses scikit-learn's TruncatedSVD and K-means implementations. Inspect the results and then move on to the next (and last) part and exercise.

Inputs:

  • X: An n-by-m data matrix where rows are data points.
  • num_clusters: The desired number of clusters to identify.
  • num_components: The truncated dimension, d.

Return: labels, an array of cluster labels. The i-th row of X is "assigned" to a cluster whose ID is labels[i]. These labels will lie in the range 0 to num_clusters-1.

In [45]:
### Solution - Exercise 7  

def lsa_svd_cluster__FREE(X: np.ndarray, num_clusters=10, num_components=10):
    """**FREE EXERCISE**: Use the SVD and K-means to compute a clustering.
This method is sometimes referred to as _latent semantic analysis_ (LSA).

**Background**: In the data-matrix factorization view of K-means, we compute $X \sim R C^T$ and use $R$ to identify clusters.
That suggests we could try _other_ factorization methods, including the singular value decomposition (SVD) from Notebook 15.

In this strategy, we compute an $d$-truncated SVD, $X \\approx U_d \Sigma_d V_d^T$, where we choose $d$ to be "small" compared to the number of columns of $X$.
In so doing, the matrix $Y \equiv U_d \Sigma_d$ acts as lower-dimensional representation of $X$, where each row of $Y$ is a $d$-dimensional version of the corresponding row of $X$.
We can then run K-means on $Y$.

Here, we are giving you some code to carry out this analysis.
This code uses scikit-learn's [TruncatedSVD](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html#) and [K-means](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#) implementations.
Inspect the results and then move on to the next (and last) part and exercise.

**Inputs**:
- `X`: An `n`-by-`m` data matrix where rows are data points.
- `num_clusters`: The desired number of clusters to identify.
- `num_components`: The truncated dimension, $d$.

**Return:** `labels`, an array of cluster labels.
The `i`-th row of `X` is "assigned" to a cluster whose ID is `labels[i]`.
These labels will lie in the range 0 to `num_clusters-1`.
    """
    # Step 1: Compute the (truncated) SVD of X: X ~ U S V^T
    from sklearn.decomposition import TruncatedSVD
    svd = TruncatedSVD(n_components=num_components, n_iter=20, random_state=1_234)
    Y = svd.fit_transform(X)
    
    # Step 2: Run K-means on U
    from sklearn.cluster import KMeans
    kmeans = KMeans(n_clusters=num_clusters, max_iter=100, n_init=1, random_state=5_678).fit(Y)
    labels = kmeans.predict(Y)
    return labels

### Demo function call
pass # bug? this line gets omitted in the output
with open('resource/asnlib/publicdata/X.dill', 'rb') as fp:
    X = dill.load(fp)
    print(f"Data matrix (`X`) shape: {X.shape[0]:,} x {X.shape[1]:,}")
lsa_labels = lsa_svd_cluster__FREE(X, num_clusters=10)
print(f"Cluster assignments: {lsa_labels}")

print(f"\nDistinctive top tokens for LSA:")
with open('resource/asnlib/publicdata/lsa_tokens.dill', 'rb') as fp:
    lsa_top_tokens, lsa_distinctive = dill.load(fp)
pprint(lsa_distinctive)
Data matrix (`X`) shape: 72,282 x 108,145
Cluster assignments: [8 4 8 ... 2 2 2]

Distinctive top tokens for LSA:
{0: {'images', 'recognition', 'image', 'processing'},
 1: {'language', 'present', 'programming', 'software'},
 2: {'learning', 'algorithms'},
 3: {'services', 'computing', 'service', 'user', 'distributed'},
 4: {'parallel', 'high'},
 5: {'development', 'research', 'study', 'technology'},
 6: {'network', 'simulation', 'mobile', 'networks', 'wireless'},
 7: {'control', 'signal', 'simulation'},
 8: {'theory', 'problems', 'set', 'algorithms', 'number', 'linear'},
 9: {'publisher', 'book', 'technology', 'software', 'guide', 'students', 'web'}}


Test cell. The cell below will test your solution for lsa_svd_cluster__FREE (exercise 7). The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [46]:
### Test Cell - Exercise 7  


print('Passed! Please submit.')
Passed! Please submit.

Nonnegative Matrix Factorization

For the final exercise, we will try one more clustering method: nonnegative matrix factorization, or NMF.

Recall that the SVD method starts by reducing the dimension of an n×m data-matrix X to a n×d matrix Y, where dm, via the SVD. Although the qualitative results in Exercise 7 appear reasonable, the matrix Y is not easy to interpret because its entries assume arbitrary values. By contrast, the K-means assignment matrix R is directly related to cluster assignments (although the centroids themselves are arbirary).

In an NMF, we again approximate X by the product of two matrices,

XWHT,
where for some choice of d<n, W is n×d and HT is d×m and, importantly, we constrain both W and H to have nonnegative entries, meaning greater than or equal to 0.

Each row of W is now a lower dimensional representation of its corresponding data point (row) in X like UdΣd in the SVD, but because of nonnegativity, its entries can be viewed as weights. That is, wij measures the strength of association between data point i and column j. That in turn implies that if we wish to use W for clustering, we can assign i to the logical cluster j where wij is highest.

Computing the NMF. The following function calculates the NMF for a given input matrix X:

def calc_nmf_W(X, num_clusters=10):
    from sklearn.decomposition import NMF
    model = NMF(n_components=num_clusters, init='random', random_state=8_917)
    W = model.fit_transform(X)
    return W

We used this function to pre-calculate a W for X. The code cell below loads this W:

In [47]:
with open('resource/asnlib/publicdata/W.dill', 'rb') as fp:
    W = dill.load(fp)
    
print(W.shape)
(72282, 10)

In the final exercise, you'll extract clusters from a given W.

Ex. 8 (2pts): get_nmf_clusters

Your task: Define get_nmf_clusters as follows:

Determine cluster assignments from an NMF-generated W matrix.

Input: W, an n-by-m matrix produced by an NMF computation.

Return:

  • labels: Cluster assignments: for each row i of W, labels[i] is the cluster (column) assignment.
  • ids: Cluster IDs: a Numpy array containing the values from 0 to m-1.
  • sizes: Cluster sizes: sizes[c] is the number of rows of W that are assigned to cluster c.

Requirements/steps:

  1. Let m be the number of clusters, which is equal to the number of columns of W.
  2. For each row i of W, assign it to the cluster c having the largest value of W[i,c].
  3. The ids array is simply a Numpy array of length m whose values go from 0 to m-1.
  4. For each cluster c, count how many rows were assigned to it and store that as sizes[c].

Example: For the demo code, a correct implementation should return:

(array([1, 0, 2, 0, 2, 1, 3, 0, 0]), array([0, 1, 2, 3]), array([4, 2, 2, 1]))
In [48]:
### Solution - Exercise 8  

def get_nmf_clusters(W: np.ndarray) -> (np.ndarray, np.ndarray, np.ndarray):
    
    # INPUT: W, an n-by-m matrix produced by an NMF computation.
    
    # GOAL: Return 3 1-D arrays:
    # labels: Cluster assignments: for each row i of W, labels[i] is the cluster (column) assignment.
    # ids: Cluster IDs: a Numpy array containing the values from 0 to m-1.
    # sizes: Cluster sizes: sizes[c] is the number of rows of W that are assigned to cluster c.
    
    # STRATEGY:
    # First, build out 'labels':
    # 1. Find column locations of max values for each row in W
    
    # Now, let's build out 'ids':
    # 2. Create 1-D array with values ranging from 0 to m-1
    
    # Finally, build out 'sizes':
    # 3. For each cluster c, count how many rows were assigned to it and store that as sizes[c]
    
    # SOLUTION:
    # Build labels using Numpy argmax:
    # https://numpy.org/doc/stable/reference/generated/numpy.argmax.html
    labels = np.argmax(W, axis=1)
    
    # Build ids and sizes using Numpy unique
    # Google Search: numpy count occurrences of all values
    # https://www.tutorialspoint.com/how-to-count-the-frequency-of-unique-values-in-numpy-array
    ids, sizes = np.unique(labels, return_counts=True)
    
    return labels, ids, sizes
    
    # ---------------------------------------------------------
    
    # Alternative solution not using Numpy unique (using Counter instead):
    W_shape = np.shape(W)
    number_of_clusters = W_shape[1]
    
    ids = np.arange(number_of_clusters)
    labels = np.argmax(W, axis=1)
    
    from collections import Counter
    
    cluster_counter_dict = Counter(labels)
    sizes_list = []
    
    for i in range(number_of_clusters):
        sizes_list.append(cluster_counter_dict[i])
        
    sizes = np.array(sizes_list)
    
    return labels, ids, sizes
    


### Demo function call
with open('resource/asnlib/publicdata/demo_get_nmf_clusters_W.dill', 'rb') as fp:
    demo_W = dill.load(fp)
print("Demo input W:")
pprint(demo_W)
print("\nYour output:")
print(get_nmf_clusters(demo_W))
Demo input W:
array([[0.00000000e+00, 3.37360731e-03, 5.59753284e-05, 0.00000000e+00],
       [7.70853381e-03, 2.13894249e-03, 0.00000000e+00, 2.50804597e-03],
       [0.00000000e+00, 1.11113220e-04, 5.72929128e-03, 0.00000000e+00],
       [1.58384215e-02, 0.00000000e+00, 2.45565407e-03, 2.93780779e-03],
       [0.00000000e+00, 1.67490852e-03, 6.83444761e-03, 3.54015172e-03],
       [2.62927826e-04, 3.43856570e-03, 0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 1.41361041e-03, 1.39216635e-03, 3.37744335e-03],
       [3.12936562e-03, 0.00000000e+00, 2.07367238e-03, 2.32231424e-03],
       [2.51065173e-03, 3.12451401e-04, 1.15266037e-03, 0.00000000e+00]])

Your output:
(array([1, 0, 2, 0, 2, 1, 3, 0, 0]), array([0, 1, 2, 3]), array([4, 2, 2, 1]))


Test cell. The cell below will test your solution for get_nmf_clusters (exercise 8). The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [49]:
### Test Cell - Exercise 8  

from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load

with open('resource/asnlib/publicdata/assignment_config.yaml') as f:
    ex_conf = safe_load(f)['exercises']['get_nmf_clusters']['config']

ex_conf['func'] = get_nmf_clusters

tester = Tester(ex_conf, key=b'SGeoIKLnOAbhAQCj3ymEM2J0nfjbk6CylhmBcXTw3s4=', path='resource/asnlib/publicdata/')
for _ in range(75):
    try:
        tester.run_test()
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise

###
### AUTOGRADER TEST - DO NOT REMOVE
###

print('Passed! Please submit.')
Passed! Please submit.

Fin

If you have made it this far, congratulations on completing it.

Don't forget to submit!

Postscript. If you are curious, the NMF method produces these distinctive tokens:

{0: {'computing', 'high', 'applications', 'distributed', 'parallel'},
 1: {'mobile', 'networks', 'applications', 'network', 'wireless'},
 2: {'analysis', 'computational', 'equations', 'model', 'numerical',
     'problems', 'solution'},
 3: {'processing', 'signal', 'error', 'noise'},
 4: {'technology', 'user', 'development', 'research', 'study', 'design'},
 5: {'learning', 'analysis', 'show', 'model'},
 6: {'images', 'analysis', 'recognition', 'show', 'image'},
 7: {'present', 'theory', 'known', 'set', 'number', 'algorithms', 'show'},
 8: {'development', 'language', 'applications', 'software', 'design', 'model'},
 9: {'power', 'technology', 'high', 'control', 'simulation', 'design'}}