This is an individual assessment.
In this assessment you will demonstrate your overall understanding
of the topics covered in FIT5196.
Task 1: Data Cleansing (%25)
For this task, you are required to write Python (Python 3) code to analyze your dataset, and find and fix the problems in the data. The input and output of this task are shown below:
Table 1. The input and output of task 1
Exploring and understanding the data is one of the most important parts of the data wrangling process. You are required to perform graphical and/or non-graphical EDA methods to understand the data first and then find and fix the data problems. You are required to:
- Detect and fix errors in <student_id>_dirty_data.csv
- Impute the missing values in <student_id>_dirty_data.csv
As a starting point, here is what we know about the dataset in hand:
The dataset (you can find it here) contains information about users (aka employees) of a particular company located in Melbourne, Australia1. Each instance of the data represents a single user from said company. The description of each data column is shown in Table 2.
Table 2. Description of the columns
ID A unique id for each user
Latlng The latitude and longitude information of the users
Title Title of the users
Age Age of the users
Gender A boolean attribute representing the gender of the users.
Work_experience A numerical attribute representing the number of years of work experience
Last_employment_date A datetime attribute representing the date of the last employment of the users
Education A boolean value representing whether the user has undergrad ( 0 ) or postgrad (1) degree
Sector A categorical attribute representing the sector of each user(i.e., IT, Finance, Data Science)
Salary A numerical attribute representing the salary of the users
- You must not delete any of the rows/columns of the input data frame or change any of the column names. Any misspelling or mismatch may lead to receiving zero mark for the output.
- There is at least one anomaly in the dataset from each category of the data anomalies (i.e.,syntactic, semantic, and coverage).
- In the file <student_id>_dirty_data.csv, any row can carry no more than one anomaly. (i.e. there can only be one anomaly in a single row and all anomalies are fixable, if there is no possible way to fix it, it is not an anomaly)
- There are no outliers in the input data. So you don’t need to identify and remove outliers.
- The company has different business rules depending on the employees’ sectors to calculate their salaries. But we know that the salary is calculated using a linear function which differs depending on the sector. The model depends linearly (but in different ways for each sector) on:
- The years of work experience of the user
- The user gender
- The user education level
- The user age
- We know that the following attributes are always correct (i.e. don’t look for any errors in dirty data for them):
7.As EDA is part of this assessment, no further information will be given publicly regarding the data. However, you can brainstorm with the teaching team.
Task 2: Parsing and Pre-processing
Textual Data (%30)
This task touches on the first steps of analyzing textual data, i.e., extracting data from semi-structured text files and performing required pre-processing steps for the purpose of quantifying the textual data. Below is the list of inputs and outputs of this task:
Table 3. The input and output of task 2
Each student is provided with a data-set that contains information about users tweets. The users are the same users from task 1. The text file contains information about the tweets, i.e., “id”,“text”. Your task here is three-fold:
- First, you need to extract the data from <student_number>_tweets.txt and transform the data into the JSON file named <student_id>_tweets.json with the same structure as the sample output file (i.e. tweets.json which can be found here).
a.You are only allowed to use the “re” package as an exernal package for this step. All the other steps including writing the JSON file must be performed only by using built-in python functionalities.
b.Hint: We recommend exploring the input text file first to come up with an efficient regular expression.
- Then, you need to add a column to your <student_id>_dirty_data_solution.csv which you created in Task1. You need to name this column “Positive_tweets_no”.
The value of this column is the number of positive tweets sent by each user in the <student_number>_tweets.txt file.
a.Note that it is possible that a particular user never sent any tweets therefore the ID of that user does not exist in the <student_number>_tweets.txt file and the value of the “Positive_tweets_no” for this user must be zero.
b.To check whether a tweet is positive, the tweets are classified using a sentiment analysis classifier. SentimentIntensityAnalyzer from nltk.sentiment.vader is used to obtain the polarity score. A sentiment is considered positive if it has a ‘compound’ polarity score of 0.05 or higher and is considered negative otherwise. Refer to this link for more details on how to use this module.
c. Note that if, for example, user A sent out 20 tweets in total and 7 of these tweets have a compound score of 0.05 or higher, then you need to put 7 in this column for user A.
- In the last task, you need to create a file named <student_id>_100uni.txt with three lines. In each line, you need to have the top 100 frequent uni-grams for each Sector (i.e., IT, Finance, and Data Science). To create the set of all unigrams, the following tasks must be performed but not necessarily in the same order. (you need to figure out the order of the operations by yourself)
a.The word tokenization must use the following regular expression,
b.The context-independent stop words must be removed from the vocab. The provided context-independent stop words list (i.e, stopwords_en.txt) must be used.
c.Tokens should be stemmed using the Porter stemmer.
d.Case normalisation on all the tokens
e.Tokens with a length less than 3 should be removed from the vocab.