本次英国代写主要为Matlab数据分析相关

INSTRUCTIONS

Please find the file uspsdata.mat in Blackboard, in the same location where you found this

coursework script. This contains the data that you will analyse. Save it locally, and use the

command load uspsdata.mat in matlab, to import the file. It contains two matrices, one is the

training set and one is the test set. Please test that you can import the data as soon as possible, and

contact the TAs if you find that you cannot. This data represents a set of square images and their

labels. Each image is a row in a matrix. Labels are in the first column, pixels are in the subsequent

256 positions. They represent handwritten digits, and there are 10 classes (10 different labels).

For each question you are expected to report an answer, in the form of a figure or a plot or a table,

as requested. It may also ask about design choices. The methods used should be mentioned but not

explained at length, unless requested. Cite sources if you are using any material external to the course.

Please upload a pdf document containing the answers through the online submission system in

Blackboard, by the deadline. Include in the pdf document the key lines of Matlab that you used to

produce the results.

TASKS:

● Q1- Download the USPS datasets from Blackboard (file: uspsdata.mat attached to the same post

where this coursework script was posted). Save it and import it into matlab with load

uspsdata.mat. – Report here descriptive statistics of each of the two datasets (e.g., size,

dimensions, number of classes, histogram, or pie chart of relative sizes of each class, etc.). Visualise

some of the data items, noting that the first entry of each data item is its “class label” and the

remaining 256 entries represent the entries of a square matrix. Visualise 4 randomly selected

images, using reshape, and the subplot and imagesc commands of matlab . Each item will be

visualized as a square image. (10 marks)

● Q2 – Visualize the datapoints in two dimensions using either Principal Components Analysis (PCA)

or Multidimensional Scaling (cmdscale). Each item will be a point in a space. Use different colors

for each class. (10 marks)

● Q3a – Cluster the (unlabeled part of the full dimensional) training data using k-means – choose an

appropriate number of clusters – measure how well the clusters match the class-labels by using

crosstab (cross tabulating) – compare different measures of similarity in kmeans – is one of them

better than the others? – Notice that the centroids can be regarded as datapoints themselves:

visualize the centroids of each class using imagesc and subplot. Each centroid can be seen as an

image. (20 marks)

● Q3b. Discuss why answers may be different in two identical runs. Use crosstab to quantify the

difference between clusters induced by two successive runs of k-means. (10 marks)

● Q4- Consider one specific class of images / vectors (that is, images having the same class-label).

Compute the mean squared distance (MSD) across all pairs of image / vector within that class.

Compare it with the same quantity measured on a random set of vectors of equal size sampled from

the whole dataset. Generate a histogram showing the distribution of MSD for random subsets of

images. Compare the histogram of the MSD for the chosen class. What can you conclude from

these results? (20 marks)

● Q5- Using the dedicated matlab commands (fitctree), train (fit) a decision tree to separate class 5

from the other 9 classes. Train on the training set, then test its performance on the test set. Report

a confusion matrix for this 2-class problem. Repeat with a Support Vector Machine (SVM,

fitcsvm). Compare confusion matrices. (20 marks)

● Q6- Repeat the above task (Q5) using k-Nearest Neighbour (fitcknn) and compare the results.

(10 marks)