This assignment has been designed to help students develop basic skills in data visualization and to allow students to practice techniques learned in lecture and tutorial.
Key Admin Information
a.ONE writtenreport (word or pdf format, through Canvas- Assignment 1 Report Submission).
b.SEVERAL Python “.py” or Jupyter Notebook “ ipynb” files (through Canvas- Assignment 1 – Upload Your Program Code Files).
2.Thelate penalty for the assignment is 5% of the assigned mark per calendar day, starting after 4pm on the due date. The closing date Monday 11 April 2022, 4:00 pm is the last date on which an assessment will be accepted for marking.
3.Length: Themain text of your report (including everything except for possible appendices) should have a maximum of 10 pages in normal 12 point fonts and single line. For each Task, you should write a sufficient and complete report with necessary plots based on your visualization, methodology, analysis, insight and limitations, etc, when possible.
4.Numberswith decimals should be reported to the Fourth–decimal point in the report.
5.Ifyou wish to include additional material, you can do so by creating an appendix. There is no page limit for the appendix. Keep in mind that making good use of your audience’s time is an essential business skill. Every sentence, table or figure has to count. Extraneous and/or wrong material will potentially affect your mark.
6.Anonymousmarking: Given the anonymous marking policy of the University, please only include your student ID (SID) in the submitted report, and do NOT include your name. The file name of your report should follow the following format. Replace “XXXX” with your SID in, for example, QBUS6860_2021S1_SIDXXXXX.pdf
7.Presentationof the assignment is part of the assessment. Markers will assign up to 10% marks for clarity of writing and presentation.
8.ForTurnitin to check your code, please copy and paste your codes into Appendix. Code should be formatted by equal width fonts such as Courier New or Consola.
If your programs are in py file, simply copy and paste into the report Appendix. If you are using Jupyter Notebook, please follow InstructionPY to convert it to “ py”
- Carefully read the requirements for each part of the assignment.
- Please follow any further instructions announced on Canvas.
- You must use Python for the assignment.
- Reproducibility is fundamental in data analysis, so that you make sure you suggest the right Python py file or Jupyter Notebook ipynb files that generate the results in your report. Markers will run your program for checking.
- The University of Sydney takes plagiarism very SERIOUSLY. Please be warned that
- plagiarism between individuals/groups is always obvious to the markers and can be
- easily detected by Turnitin.
- Not submitting your code will lead to a loss of 50% of the assignment marks.
- Failure to read information and follow instructions may lead to a loss of marks. Furthermore, note that it is your responsibility to be informed of the University of Sydney and Business School rules and guidelines, and follow them.
- Referencing: Business School recommends APA Referencing System. (You may find the details at: https://libguides.library.usyd.edu.au/citation/apa7 )
- Feedback will be provided on the marked submission.
Task A (40 Marks)
This task is designed for you to practice your skills in conducting basic Visual Data Analytics (VDA) and Exploratory Data Analysis (EDA).
The COVID- 19 pandemic in Australia is part of the ongoing worldwide pandemic of the coronavirus disease 2019 (COVID- 19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). The first confirmed case in Australia was identified on 25 January 2020, in Victoria (fromhttps://en.wikipedia.org/wiki/COVID– 19_pandemic_in_Australia). Since then the Australian Federal Government has collected data for the COVID- 19 pandemic in Australia. The data is useful in making decision on public policies by all type agencies.
https://www.covid19data.com.au/is a place to get the updated Covid- 19 Data for Australia, which is from the Australian Federation Government (https://www.health.gov.au/health– alerts/covid-19) and State Government Health agencies. You can download the dataset as described in https://www.covid19data.com.au/data–notes, or from Matt Bolton’s GitHub repository https://github.com/M3IT/COVID-19_Data/tree/master/Data
A copy of the dataset has been on Canvas for your convenience, but you are encouraged to download the most recently updated data from the above GitHub site directly. The data files are all in csv format. It is easy to identify the meaning of each column in each file.
You are receiving 2 visualisation types at random (e.g., your randomly selected types could be violin and scatterplot or histogram and bubble plot, etc.). Please check the list file QBUS6860_Assignment01_RandomTask.xlsx for your assigned visualization type by using your Student ID. This is file on Canvas along with this document.
- [8Marks] Playwith all the dataset files, report and explain all the statistics, such as the total positive COVID- 19 cases so far etc.
- [12Marks] Useyour two randomly assigned visualisation types to analyse the data (you may use other types in addition to the types you are assigned, but you must use your assigned types). For example, you were assigned histogram and bubble plot but you think that the data could be better represented using a stream graph. You may use stream graph in addition to histogram and bubble plot, but you must use at least histogram and bubble plot in your analysis. If an assigned type is not appropriate for this set of data, please explain the reason.
Always keep in mind the visual presentation should be meaningful and visually pleasing.
- [10Marks] Conductappropriate analysis and report your insights. You shall consider this task as challenging.
- [5Marks] Summarise your conclusion on for example whether data is in good quality, what else information can be collected, so to put forwards your suggestion.
Note: The other 5 marks are allocated for presentation quality
Task B (60 Marks)
Finding ICLR2022 (https://iclr.cc/Conferences/2022/) Authors Affiliation(s) and Email Address(es) from OpenReview site https://openreview.net/group?id=ICLR.cc/2022/Conference. This task is designed for you to apply techniques in data management and EDA.
The International Conference on Learning Representations (ICLR) is the premier gathering of professionals dedicated to the advancement of the branch of artificial intelligence called representation learning, but generally referred to as deep learning. ICLR is globally renowned for presenting and publishing cutting-edge research on all aspects of deep learning used in the fields of artificial intelligence, statistics and data science, as well as important application areas such as machine/computer vision, computational biology, speech recognition, text understanding, gaming, and robotics.
You may re-use part of tutorial codes and revise it for your purpose here.
- [5Marks] Acquire ICLR2022 authors ids each of which is either an OpenReview ID or an email address of an author. You may rely on some code snippets from Tutorial 3.
- [12Marks:Challenging] Write Python code to extract all the authors profiles. As shown in Tutorial 3, each author has an ID on OpenReview site (or email address). You need to get IDs for all ICLR2022 authors. Then an author profile can be accessed like https://openreview.net/profile?id=~Junbin_Gao1 where ~Junbin_Gao1 is called author ID (username). On a sample page, locate where Author Affiliation and Email Address is, then try to write your own web crawler to get this information for all ICLR2022 authors.
Warning: prepare to wait for getting all the information after you deploy your crawler.
- [12Marks:Challenging] Explore and report some statistics, such as the total number of authors, how many missing values for their affiliations or emails, how many different affiliations, where are authors from etc.
Note 1: Generally speaking, each appearance of an author ID means a paper submission. It is possible to tell how many papers an author submitted and how many papers from a particular organisation.
Note 2: Openreview captures all the emails for the organisations with which an author is associated or/and was associated. I suggest you use the first email address in their email list as an author’s current affiliation.
Note 3: As there is no country information collected in author profile, you may need to rely on email domain to map to a country, for example, from sydney.edu.au we know au is the code for Australia. But people may use some common email domains such as [‘gmail.com’, ‘qq.com’, ‘126.com’, ‘163.com’, ‘outlook.com’, ‘hotmail.com’, ‘yahoo.com’, ‘foxmail.com’, ‘aol.com’, ‘msn .com’, ‘ymail.com’, ‘googlemail.com’, ‘live.com’]. In this case, please take the following strategy: (1) if such as a common email address appears as an author’s first email address, then check the second email address to identify the countr; (2) if such a common email address is the only email address for the author, you may aggregate them in a group of “unidentifiable” .
- [8Marks] Visuallypresent the statistical information you have discovered in Task 3.
- [10Marks] Identifyor discuss whether there is any missing information in Task 2. What is your suggestion regarding this?
- [8Marks] (Challenging!) Segment authors into three major groups: University, IT Company (eg. Google, Tencent etc), and Others.
Note: The other 5 marks are allocated for presentation quality