# BEST代写-线上编程学术专家

Best代写-最专业靠谱代写IT | CS | 留学生作业 | 编程代写Java | Python |C/C++ | PHP | Matlab | Assignment Project Homework代写

# 机器学习代写｜CSC246/446 Machine Learning Homework 5: Clustering

## 这是一篇来自美国的关于将应用EM算法来使用高斯混合物来建模每个数据集的机器学习代写

Objective

You are given two kinds of dataset. Datasets A, B, C, and Z were created by the instructor by sampling from four different Gaussian mixtures, with different numbers of mixture components. Your first task is to apply EM to each of these datasets and to attempt to uncover the ”true” number of mixture components. Exactly how to do this is up to you.

The second dataset is the famous Sonar benchmark dataset, which consists of 208 samples of 60 real valued features. The features correspond to various sonar (link) measurements of mines (bombs) and rocks in the ocean. Although the task is a supervised one, we can also consider applying the EM algorithm to build a two-class GMM out of the dataset (ignoring the sonar vs mine label).

Your overall objective is to apply the EM algorithm to model each dataset using Gaussian mixtures. For the synthetic datasets, you will have to determine an appropriate number of clusters. (This can be a manual process — you do NOT have to automate that for this assignment.). For the Sonar dataset, the number of clusters should be two. The synthetic datasets are included with the assignment posting on blackboard; however, you should obtain the Sonar dataset directly from UCI.

Evaluating Clusterings

Clusterings can be evaluated in two ways — supervised and unsupervised. For supervised clusterings, we assume that we have access to some ”true” category labels, and use those to evaluate the quality of the assigned clusters. Two well-known metrics are the Rand index (link) and Mutual Information (link). For unsupervised clustering, we can only try to obtain ”coherent” clusters, aiming for properties such as having lower within-class variance than between-class variance. The Silhouette coefficient is a popular unsupervised clustering metric (link), but there are others.

Of course, being probabilistic models, Gaussian mixture fits can also be evaluated in terms of model likelihood. (Unfortunately this kind of evaluation cannot be applied to simpler algorithms such as k-means, because k-means does not involve any probability.)

Modeling and Experimentation

As you know, the EM algorithm is guaranteed to converge, but it may converge to a local rather than global optimum. Therefore you should experiment with various initialization methods. The sklearn methods support random initialization and k-means based initialization (as well as two variants on those.)

Similarly, as a maximum likelihood estimator, if one of your cluster centers happens to exactly line up with a sample point, your model will diverge toward a singularity. To reduce this, you can apply regularization. Luckily, the sklearn implementation has a default regularization value, but if you find it is not sufficient, you may need to adjust it.

You should obtain the following results: For each of the datasets (A,B, C, Z, and Sonar) produce plots of llkelihood and Silhouette score vs number of iterations of EM. You can produce combined plots with both functions per plot, or you can produce separate plots.

For the 2D datasets (A, B, and C) you should produce a color-coded visualization of the clusters. (There are examples of how to do this in the scikit learn documentation.)

For the sonar dataset, you should also produce a plot of mutual information and Silhouette score between the inferred clustering and the true labels vs the number of iterations of EM.

Resources

The primary resources for this assignment is the sklearn documentation:

• https://scikit-learn.org/stable/modules/clustering.html
• https://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation
• https://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html
• https://archive.ics.uci.edu/ml/datasets/connectionist%20bench%20(sonar,%20mines%20vs.%20rocks) 