Goal: The goal of this assignment is to gain familiarity and practical experience with the MapReduce programming model and give you experience with text processing.
Programming Language: Python
Evaluation: Part of our evaluation is through testing. Our test cases will NOT be made available to you before submission. It is your responsibility to test extensively. For the different parts of the assignments do not concern yourself with stop words or changing the case of letters
This assignment is complex, more so than A1; start early, check through the provided documents/documentation, and make use of the forum. I will be slightly more hands-off to allow people to work through the complexities of learning a new system and environment.
(Expect responses during reading week to be delayed by 24-72 hours).
Part 0: VM Setup
The Department of Computer Science has a cluster where Cloudera has been installed. Cloudera is a company that provides Apache Hadoop. There are multiple VMs which can handle multiple people using them simultaneously.
You can ssh in using your ID (replace bdavis56) and by changing vmX to vm1 through to vm10 ssh bdavis56@cs4417-vmX.gaul.csd.uwo.ca
Mac OS X & Linux terminals should have a built in SSH and SFTP client. Windows users are strongly recommended to use WSL2 / Ubuntu for Windows; you can use putty and WinSCP if you wish as well.
The software you need to complete the assignment is on the VM. You do not need to install anything.
We will be using a container for the VMs. This has been graciously provided by Gary Molenkamp here at Western. If you do not want to use the VM and have the technical skill required to run your own container, you may be able to do the assignment locally.
Container link and documentation:
Part 1: Calculate the frequency of a term in each document (20 points)
Given a set of documents, calculate the frequency of a term in each document. The output should be the term, document and number of occurrences of the term in the document. This is different from the example presented in the lectures in that the example focused on one document. In this example, the final output should consist of pairs in the following form:
((term, document identifier), count)
Submission: You should submit a zip file with the name Part1.zip. When unzipped there should be two files: mapper.py and reducer.py.
Part 2: Count Bigrams (15 points)
Take the word count example and extend it to count bigrams which refers to sequences of two consecutive words.
You should make use of Hadoop for this part.
Submission: You should submit a zip file with the name Part2.zip. When unzipped there should be two files: mapper.py and reducer.py.
Part 3: Count Unique Bigrams (15 points)
This is an extension of part 2 where you count the number of unique bigrams. One approach is to use two MapReduce passes. The first is what you did for Part 2 and the second is something you need to develop.
Submission: You should submit a zip file with the name Part3.zip. When unzipped there should be two files for each MapReduce pass, i: mapperi.py and reduceri.py. For example, if i is 1 then you should have mapper1.py and reducer1.py and if i is 2 then you should have mapper2.py and reducer2.py.