Best代写-最专业靠谱代写IT | CS | 留学生作业 | 编程代写Java | Python |C/C++ | PHP | Matlab | Assignment Project Homework代写

统计机器学习代写|COMP90051 Statistical Machine Learning Project 2 Specification

统计机器学习代写|COMP90051 Statistical Machine Learning Project 2 Specification



1 Overview

Authorship attribution is the task of identifying the author of a given document. It has been studied extensively in machine learning, natural language processing, linguistics, and privacy research and is widely applied in real-world settings. Establishing the provenance of historical texts is helpful in placing important documents in their historical contexts. A whistleblower may assume that their identity is private when posting online anonymously or pseudony mously without attaching their real name. Where authorship attribution can be used to reveal their identity, as a kind of privacy attack, it is important to study the possibility of such attacks so that more robust defences can be designed.

Finally plagiarism detection makes use of authorship attribution with applications from educational assessment to intellectual property law in business.

Your task:

Your task is to come up with test predictions for an authorship attribution problem given a training set and test inputs.

You will participate as part of a group of students in a Kaggle competition, where you upload your test predictions.

Your mark (detailed below) will be based on your test prediction performance and a short report documenting your solution.

The training data is a set of academic papers published in a time period (spanning 19 years), the given paper information includes the year it was published, the words in the title and abstract, the venue it was published in, and its authors. All the information in the discrete data has been given randomly assigned IDs, except year of publication.

The test data is a list of 800 papers, all published in the year after the training period. Your task is to predict for each of the test papers, which of a set of 100 prolific authors were involved in writing the paper. The correct answer may be zero, one or many of these authors.

2 Dataset

train.json – contains 25.8k papers. This file is in JSON format as a list of papers, where each paper is a dictionary with keys:

  • authors: a set of the IDs of the authors of the paper, with values in {0,…,21245};
  • year: the year the paper was published, measured in years from the start of the training period;
  • venue: the publication venue (name of journal/conference series), mapped to a unique integer value {0,…,464} or “” if there is no specified venue;
  • title: the sequence of words in paper title, after light preprocessing, where each word has been mapped to an index in {1,…,4999}; and
  • abstract: the sequence of words in paper abstract, proceessed as above, using the same word-integer mapping.

Authors with IDs < 100 are the prolific authors, the target of this classification task. Many of the papers in train.json don’t include any prolific authors; you will have to decide whether (and how) to use these instances in training. Note that we include some papers in the test set (described below) which have no prolific authors (no more than 25% of papers), so you will need to be able to handle this situation.

test.json – contains 800 papers, stored in JSON format with the fields year, venue, title and abstract as described above, along with one additional item:

  • identifier: The unique identifier of the paper, used to ensure your predictions are aligned correctly in Kaggle;
  • coauthors: The IDs of the co-authors of the paper, with values in {100,…,21245} (profilic authors with IDs < 100 are excluded). This field may be empty if there are no co-authors.

2.1 Kaggle Submission Format

You will need to submit your predictions on the 800 test papers to Kaggle at least once during the project (but ideally several times). To accomplish this, you will place your 800 predictions in a file of a certain format (described next) and upload this to Kaggle.

If your predictions are {1} for first test paper, {2,3} for the second test paper, and {} for the third test paper, then your output file should be as follows in CSV format:



1,2 3


Note that the special -1 label used for the empty set prediction, and that the Id field is the identifier value of the corresponding paper.

The test set will be used to generate a F1-score for your performance1 you may submit test predictions multiple times per day (if you wish). Section 6 describes rules for who may submit—in short you may only submit to Kaggle as a team not individually. During the competition the F1 on a 50% subset of the test set will be used to rank you in the public leaderboard. We will use the other 50% of the test set to determine your final F1 and ranking. The split of test set during/after the competition is used to discourage you from constructing algorithms that overfit on the leaderboard. The training data “train.json”, the test set “test.json”, and a sample submission file “sample.csv” will be available within the Kaggle competition website. In addition to using the competition test data, so as to prevent overfitting, we encourage you to generate your own test validation data from the training set, and test your algorithms with that validation data also.

3 Report

A report describing your approach should be submitted through the Canvas LMS by 5pm Friday 21st October 2022 .

It should include the following content:

  1. A brief description of the problem and introduction of any notation that you adopt in the report;
  2. Description of your final approach(s) to authorship attribution, the motivation and reasoning behind it, and why you think it performed well/not well in the competition; and
  1. Any other alternatives you considered and why you chose your final approach over these (this may be in the form of empirical evaluation, but it must be to support your reasoning—examples like “method A, got F1 0.6 and method B, got F1 0.7, hence I use method B”, with no further explanation, will be marked down).

Your description of the algorithm should be clear and concise. You should write it at a level that a postgraduate student can read and understand without difficulty. If you use any existing algorithms, please do not rewrite the complete description, but provide a summary that shows your understanding and references to the relevant literature. In the report, we will be interested in seeing evidence of your thought processes and reasoning for choosing one algorithm over another.

Dedicate space to describing the features you used and tried, any interesting details about software setup or your experimental pipeline, and any problems you encountered and what you learned. In many cases these issues are at least as important as the learning algorithm, if not more important.

Report format rules. The report should be submitted as a PDF, and be no more than three pages, single column.

The font size should be 11pt or above and margins should be at least 1.5cm on all sides, i.e., like this document. If a report is longer than three pages in length, we will only read and assess the report up to page 3 and ignore further pages. (Don’t waste space on cover pages. References and appendices are included in the page limit—you don’t get extra pages for these. Double-sided pages don’t give you extra pages—we mean equivalent of three single-sided.

Three pages means three pages total. Learning how to concisely communicate ideas in short reports is an incredibly important communication skill for industry, government, academia alike.)

4 Submission

The final submission will consist of three parts:

  • By 12pm noon Friday 21st October 2022 , submitted to the Kaggle competition website: A valid submission to the Kaggle competition. This submission must be of the expected format as described above, and produce a place somewhere on the leaderboard. Invalid submissions do not attract marks for the competition portion of grading (see Section 5).
  • By 5pm Friday 21st October 2022 , submitted to the Canvas LMS:

A written research report in PDF format (see Section 3).

A zip archive2 of your source code3 of your authorship attribution algorithm including any scripts for automation, and a README.txt describing in just a few lines what files are for. Again, do not submit data.

You may include Slack/Github logs. (We are unlikely to run your code, but we may in order to verify the work is your own, or to settle rare group disputes.)

Your Kaggle team name—without your exact Kaggle team name, we may not be able to credit your Kaggle submission marks which account for almost half of project assessment.

The submission link will be visible in the Canvas LMS prior to deadline.

Note that after about a week into Project 2 you will need to also submit a Group Agreement. While not a formal legal contract, completing the agreement together is a helpful way to open up communication within your team, and align each others’ expectations.