Best代写-最专业靠谱代写IT | CS | 留学生作业 | 编程代写Java | Python |C/C++ | PHP | Matlab | Assignment Project Homework代写

Python代写|CS 4417 Assignment 2

Python代写|CS 4417 Assignment 2


Goal: The goal of this assignment is to gain familiarity and practical experience with the
MapReduce programming model and give you experience with text processing.

Programming Language: Python

Evaluation: Part of our evaluation is through testing. Our test cases will NOT be made
available to you before submission. It is your responsibility to test extensively. For the different
parts of the assignments do not concern yourself with stop words or changing the case of letters

This assignment is complex, more so than A1; start early, check through the provided
documents/documentation, and make use of the forum. I will be slightly more hands-off to
allow people to work through the complexities of learning a new system and environment.
(Expect responses during reading week to be delayed by 24-72 hours).

Part 0: VM Setup

The Department of Computer Science has a cluster where Cloudera has been installed. Cloudera
is a company that provides Apache Hadoop. There are multiple VMs which can handle multiple
people using them simultaneously.

You can ssh in using your ID (replace bdavis56) and by changing vmX to vm1 through to vm10

Mac OS X & Linux terminals should have a built in SSH and SFTP client. Windows users are
strongly recommended to use WSL2 / Ubuntu for Windows; you can use putty and WinSCP if
you wish as well.

The software you need to complete the assignment is on the VM. You do not need to install

We will be using a container for the VMs. This has been graciously provided by Gary
Molenkamp here at Western. If you do not want to use the VM and have the technical skill
required to run your own container, you may be able to do the assignment locally.
Container link and documentation:

Part 1: Calculate the frequency of a term in each document (20 points)

Given a set of documents, calculate the frequency of a term in each document. The output
should be the term, document and number of occurrences of the term in the document. This is
different from the example presented in the lectures in that the example focused on one
document. In this example, the final output should consist of pairs in the following form:
((term, document identifier), count)

Submission: You should submit a zip file with the name When unzipped there should
be two files: and

Part 2: Count Bigrams (15 points)

Take the word count example and extend it to count bigrams which refers to sequences of two
consecutive words.

You should make use of Hadoop for this part.

Submission: You should submit a zip file with the name When unzipped there should
be two files: and

Part 3: Count Unique Bigrams (15 points)

This is an extension of part 2 where you count the number of unique bigrams. One approach is to
use two MapReduce passes. The first is what you did for Part 2 and the second is something you
need to develop.

Submission: You should submit a zip file with the name When unzipped there should
be two files for each MapReduce pass, i: and For example, if i is 1 then
you should have and and if i is 2 then you should have and