Introduction to Data Science – EMAT20011 Final Assessed Coursework 2020-2021
Please submit by: 1pm Friday May 7th, 2021 ⇒ Via Blackboard –
Please include the basic Matlab commands used to generate a result.
This document is divided into 2 parts: instructions and tasks.
Please find the file uspsdata.mat in Blackboard, in the same location where you found this coursework script. This contains the data that you will analyse. Save it locally, and use the command load uspsdata.mat in matlab, to import the file. It contains two matrices, one is the training set and one is the test set. Please test that you can import the data as soon as possible, and contact the TAs if you find that you cannot. This data represents a set of square images and their labels. Each image is a row in a matrix. Labels are in the first column, pixels are in the subsequent 256 positions. They represent handwritten digits, and there are 10 classes (10 different labels).
For each question you are expected to report an answer, in the form of a figure or a plot or a table, as requested. It may also ask about design choices. The methods used should be mentioned but not explained at length, unless requested. Cite sources if you are using any material external to the course.
Please upload a pdf document containing the answers through the online submission system in Blackboard, by the deadline. Include in the pdf document the key lines of Matlab that you used to produce the results.
Please state your name and 7-digit student-number (not candidate number) at the top of your report – it can
be found on your card. This is an individual coursework; you are expected to work alone.
This assessed coursework is based on the examples and lab practice of the course, it is worth 100% of the unit. Please
submit your work as a PDF document, containing all text and results of your work. use MATLAB to complete the tasks
described below. Keep your answers short.
- ● Q1- Download the USPS datasets from Blackboard (file: uspsdata.mat attached to the same post where this coursework script was posted). Save it and import it into matlab with load uspsdata.mat. – Report here descriptive statistics of each of the two datasets (e.g., size, dimensions, number of classes, histogram, or pie chart of relative sizes of each class, etc.). Visualise some of the data items, noting that the first entry of each data item is its “class label” and the remaining 256 entries represent the entries of a square matrix. Visualise 4 randomly selected images, using reshape, and the subplot and imagesc commands of matlab . Each item will be visualized as a square image. (10 marks)
- ● Q2 – Visualize the datapoints in two dimensions using either Principal Components Analysis (PCA) or Multidimensional Scaling (cmdscale). Each item will be a point in a space. Use different colors for each class. (10 marks)
- ● Q3a – Cluster the (unlabeled part of the full dimensional) training data using k-means – choose an appropriate number of clusters – measure how well the clusters match the class-labels by using crosstab (cross tabulating) – compare different measures of similarity in kmeans – is one of them better than the others? – Notice that the centroids can be regarded as datapoints themselves: visualize the centroids of each class using imagesc and subplot. Each centroid can be seen as an
image. (20 marks)
- ● Q3b. Discuss why answers may be different in two identical runs. Use crosstab to quantify the difference between clusters induced by two successive runs of k-means. (10 marks)
- ● Q4- Consider one specific class of images / vectors (that is, images having the same class-label). Compute the mean squared distance (MSD) across all pairs of image / vector within that class. Compare it with the same quantity measured on a random set of vectors of equal size sampled from the whole dataset. Generate a histogram showing the distribution of MSD for random subsets of images. Compare the histogram of the MSD for the chosen class. What can you conclude from these results? (20 marks)
- ● Q5- Using the dedicated matlab commands (fitctree), train (fit) a decision tree to separate class 5 from the other 9 classes. Train on the training set, then test its performance on the test set. Report a confusion matrix for this 2-class problem. Repeat with a Support Vector Machine (SVM, fitcsvm). Compare confusion matrices. (20 marks)
- ● Q6- Repeat the above task (Q5) using k-Nearest Neighbour (fitcknn) and compare the results. (10 marks)
Please submit your report as a pdf file, including Figures and the key Matlab commands used to generate the results.