Please find the file uspsdata.mat in Blackboard, in the same location where you found this
coursework script. This contains the data that you will analyse. Save it locally, and use the
command load uspsdata.mat in matlab, to import the file. It contains two matrices, one is the
training set and one is the test set. Please test that you can import the data as soon as possible, and
contact the TAs if you find that you cannot. This data represents a set of square images and their
labels. Each image is a row in a matrix. Labels are in the first column, pixels are in the subsequent
256 positions. They represent handwritten digits, and there are 10 classes (10 different labels).
For each question you are expected to report an answer, in the form of a figure or a plot or a table,
as requested. It may also ask about design choices. The methods used should be mentioned but not
explained at length, unless requested. Cite sources if you are using any material external to the course.
Please upload a pdf document containing the answers through the online submission system in
Blackboard, by the deadline. Include in the pdf document the key lines of Matlab that you used to
produce the results.
● Q1- Download the USPS datasets from Blackboard (file: uspsdata.mat attached to the same post
where this coursework script was posted). Save it and import it into matlab with load
uspsdata.mat. – Report here descriptive statistics of each of the two datasets (e.g., size,
dimensions, number of classes, histogram, or pie chart of relative sizes of each class, etc.). Visualise
some of the data items, noting that the first entry of each data item is its “class label” and the
remaining 256 entries represent the entries of a square matrix. Visualise 4 randomly selected
images, using reshape, and the subplot and imagesc commands of matlab . Each item will be
visualized as a square image. (10 marks)
● Q2 – Visualize the datapoints in two dimensions using either Principal Components Analysis (PCA)
or Multidimensional Scaling (cmdscale). Each item will be a point in a space. Use different colors
for each class. (10 marks)
● Q3a – Cluster the (unlabeled part of the full dimensional) training data using k-means – choose an
appropriate number of clusters – measure how well the clusters match the class-labels by using
crosstab (cross tabulating) – compare different measures of similarity in kmeans – is one of them
better than the others? – Notice that the centroids can be regarded as datapoints themselves:
visualize the centroids of each class using imagesc and subplot. Each centroid can be seen as an
image. (20 marks)
● Q3b. Discuss why answers may be different in two identical runs. Use crosstab to quantify the
difference between clusters induced by two successive runs of k-means. (10 marks)
● Q4- Consider one specific class of images / vectors (that is, images having the same class-label).
Compute the mean squared distance (MSD) across all pairs of image / vector within that class.
Compare it with the same quantity measured on a random set of vectors of equal size sampled from
the whole dataset. Generate a histogram showing the distribution of MSD for random subsets of
images. Compare the histogram of the MSD for the chosen class. What can you conclude from
these results? (20 marks)
● Q5- Using the dedicated matlab commands (fitctree), train (fit) a decision tree to separate class 5
from the other 9 classes. Train on the training set, then test its performance on the test set. Report
a confusion matrix for this 2-class problem. Repeat with a Support Vector Machine (SVM,
fitcsvm). Compare confusion matrices. (20 marks)
● Q6- Repeat the above task (Q5) using k-Nearest Neighbour (fitcknn) and compare the results.