You are given two kinds of dataset. Datasets A, B, C, and Z were created by the instructor by sampling from four different Gaussian mixtures, with different numbers of mixture components. Your first task is to apply EM to each of these datasets and to attempt to uncover the ”true” number of mixture components. Exactly how to do this is up to you.
The second dataset is the famous Sonar benchmark dataset, which consists of 208 samples of 60 real valued features. The features correspond to various sonar (link) measurements of mines (bombs) and rocks in the ocean. Although the task is a supervised one, we can also consider applying the EM algorithm to build a two-class GMM out of the dataset (ignoring the sonar vs mine label).
Your overall objective is to apply the EM algorithm to model each dataset using Gaussian mixtures. For the synthetic datasets, you will have to determine an appropriate number of clusters. (This can be a manual process — you do NOT have to automate that for this assignment.). For the Sonar dataset, the number of clusters should be two. The synthetic datasets are included with the assignment posting on blackboard; however, you should obtain the Sonar dataset directly from UCI.
Clusterings can be evaluated in two ways — supervised and unsupervised. For supervised clusterings, we assume that we have access to some ”true” category labels, and use those to evaluate the quality of the assigned clusters. Two well-known metrics are the Rand index (link) and Mutual Information (link). For unsupervised clustering, we can only try to obtain ”coherent” clusters, aiming for properties such as having lower within-class variance than between-class variance. The Silhouette coefficient is a popular unsupervised clustering metric (link), but there are others.
Of course, being probabilistic models, Gaussian mixture fits can also be evaluated in terms of model likelihood. (Unfortunately this kind of evaluation cannot be applied to simpler algorithms such as k-means, because k-means does not involve any probability.)
Modeling and Experimentation
As you know, the EM algorithm is guaranteed to converge, but it may converge to a local rather than global optimum. Therefore you should experiment with various initialization methods. The sklearn methods support random initialization and k-means based initialization (as well as two variants on those.)
Similarly, as a maximum likelihood estimator, if one of your cluster centers happens to exactly line up with a sample point, your model will diverge toward a singularity. To reduce this, you can apply regularization. Luckily, the sklearn implementation has a default regularization value, but if you find it is not sufficient, you may need to adjust it.
You should obtain the following results: For each of the datasets (A,B, C, Z, and Sonar) produce plots of llkelihood and Silhouette score vs number of iterations of EM. You can produce combined plots with both functions per plot, or you can produce separate plots.
For the 2D datasets (A, B, and C) you should produce a color-coded visualization of the clusters. (There are examples of how to do this in the scikit learn documentation.)
For the sonar dataset, you should also produce a plot of mutual information and Silhouette score between the inferred clustering and the true labels vs the number of iterations of EM.
The primary resources for this assignment is the sklearn documentation:
Grading and Submission
Submit a zipfile of your code and a PDF writeup with plots/explanations by 1159pm on the last day of class (April 25th) Your submission should contain the following components:
- 25% – At least four plots (or one really cool one) with likelihood and Silhouette coefficient per iteration for the synethetic data (A, B, C, and Z).
- 15% – Three separate color-coded plots of your best clusterings for low-dimensional synthetic data (A, B, and C).
- 25% – Plots of likelihood, mutual information, and Silhouette score per iteration of EM for Sonar data.
- 35% – Writeup — you should also submit a brief writeup which should describe your experiments and wor. In particular, you should describe how you selected the number of clusters for ABCZ (10%), discuss what you found to be the best schemas for initialization and evaluation (10%), and discuss how well the method worked in recovering the true clusters for the Sonar data. (10%). (Leaving 5% for issues of formatting and editing – e.g., prose style and quality.)