# C++数据结构代写 | CSE 2341 Data Structures Programming Assignment 01

## Introduction

Have you ever read a tweet and thought, “ Gee, what a positive outlook! ” or “ Wow, why so negative, friend? ” Can
computers make the same determination? They can surely try!

In Machine Learning, the task of assigning a label to a data item is called classiﬁcation (putting things into
different classes or categories). The more speciﬁc name for what we’re going to do is sentiment analysis
because you’re trying to determine the “sentiment” or attitude based on the words in a tweet. Project 1 is to
build a sentiment classiﬁer! Aren’t you excited?? ( ← That would be positive sentiment!)
You’ll be given a set of tweets called the training data set that are already pre-classiﬁed by humans as
positive or negative based on their content. You’ll analyze the frequency of occurrences of words in the
tweets to develop a classiﬁcation scheme . Using your classiﬁcation scheme, you’ll then classify another set of
tweets called the testing data set and predict if each tweet has positive sentiment or negative sentiment.

## Building a Classifier

The goal in classiﬁcation is to assign a class label to each element of a data set, positive or negative in our
case. Of course, we would want this done with the highest accuracy possible. At a high level, the process to
build a classiﬁer (and many other machine learning models) has two major steps:

### 1. Training Phase

○ Input is a set of tweets with each tweet’s associated sentiment value. This is the training data
set .

○ Example: Assume you have 10 tweets and each is pre-classiﬁed with positive or negative
sentiment. How might you go about analyzing the words in these 10 tweets to ﬁnd words more
commonly associated with negative sentiment and words more commonly associated with
positive sentiment?

○ The result of the training step will be one list of words that have an associated positive or
negative sentiment with them depending on which type of tweet they appeared in more
frequently. OR, you might have 2 lists of words: one list is positive, one list is negative.

### 2. Testing Phase

○ In the testing phase, for a set of tweets called the testing data set , you predict the sentiment
by using the list or lists of words extracted during the training phase.

○ Behind the scenes, you already know the sentiment of each of the tweets in the testing data
set. We’ll call this the actual sentiment or known sentiment. You then compare the
predicted sentiment from the testing phase to the known sentiment for each of the testing
tweets. Some of the predictions will be correct; some will be wrong. The percentage correct is
the accuracy, but more on this later.

## The Real Data

The data set we will be using in this project comes from real tweets posted around 11-12 years ago. The
it into the ﬁle format we are using for this project. For more information, please see Go, A., Bhayani, R. and
Huang, L., 2009. Twitter sentiment classiﬁcation using distant supervision. CS224N Project Report, Stanford,
1(2009), p.12.

### Input files

There will be 3 different input ﬁles:

1. Training Data
2. Testing Data (no sentiment column)
3. Testing (id and sentiment for testing data for you to compare against).

The training data ﬁle is formatted as follows:

● A comma-separated-values (CSV) ﬁle containing a list of tweets, each one on a separate line. Each line
of the data ﬁles include the following ﬁelds:

○ Sentiment value ( negative = 0, positive = 4 ),
○ the tweet id,
○ the date the tweet was posted
○ Query status (you will ignore this column)
○ the text of the tweet itself

The testing data set is broken into two ﬁles:

● A CSV ﬁle containing formatted just like the training data EXCEPT no Sentiment column

● A CSV ﬁle containing tweet ID and sentiment for the testing dataset (so you can compare your
predictions of sentiment to the actually sentiment ground truth)

Below are two example tweets from the training dataset:

4,1467811594,Mon Apr 06 22:20:03 PDT 2009,NO_QUERY,peruna_pony,”Beat TCU”
0,1467811595,Mon Apr 06 22:22:03 PDT 2009,NO_QUERY,the_frog,”Beat SMU”

Here are two tweets from the testing dataset:
1467811596,Mon Apr 06 22:20:03 PDT 2009,NO_QUERY,peruna_pony,”SMU > TCU”

The sentiment ﬁle for that testing tweet would be:
4, 1467811596

### Output Files

There will be one output ﬁle organized as follows:

● The ﬁrst line of the output ﬁle will contain the accuracy , a single ﬂoating point number with exactly 3
decimal places of precision. See the section “How good is your classiﬁer” below to understand
Accuracy.

● The remaining lines of the ﬁle will contain the Tweet IDs of the tweets from the testing data set that
your algorithm incorrectly classiﬁed. This could be useful information as you tweak your algorithm to
increase efﬁciency

Example of the Testing Data tweet classiﬁcations ﬁle (These tweet IDs are fake):
0.500
2323232323
1132553423