Best代写-最专业靠谱代写IT | CS | 留学生作业 | 编程代写Java | Python |C/C++ | PHP | Matlab | Assignment Project Homework代写

Python代写|COMP809: Data Mining & Machine Learning Assignment 1

Python代写|COMP809: Data Mining & Machine Learning Assignment 1




This assignment gives you an opportunity to solve two real-word data mining

problems using the machine learning workbench. In the two questions given below

justification of your answers carries a high proportion of the marks awarded. You

are required to conduct experiments for both case studies and report them according

to the specified requirements.


Study Area I (Dataset is Boat Dataset)

This case study is concerned with decision making process to help a customer to

evalute the condition of a boat. The dataset contains 6 attributes for which one of 4

class lables are assigned to as boat condition. You can find more information about

the dataset and its attributes in metadata file under Assignment 1 materials.

You are required to build a model using the Decision Tree Classifier and answer the

following questions based on the model built. Use the data segment on the

subscriptions whose outcomes are known. In building the model, use the 10-fold

cross validation option for testing.

Your answers below need to be supported by suitable evidence, wherever

appropriate. Some examples of suitable evidence are the Confusion Matrices, Model

Visualizations and Summary Statistics.


  1. a) Describe the pre-processing you have performed to prepare your data.[5 Marks]


  1. b) Using an appropriate method identify the most influential features in classifying

this dataset. Explain the process of the chosen feature selection method and the

number of selected features for your dataset.[5 marks]


  1. c) Perform initial data exploration by analyisng the summary statiscs of the selected

variables. Identify the variance for each variable and describe their distribution

using appropriate plot(s).[5 marks]


  1. d) Now build a model using the Decision Tree algorithm. By adjusting two suitable

parameters (one at a time) reduce the size of the tree to not more than 10 to 15

nodes in order to improve the interpretability of the model generated. Which of

the two parameters yielded better accuracy while producing smaller trees?

Analyse your findings and discuss the results. Visulaise the final generated

decision tree and describe it.[10 marks]


  1. e) Describe the role of the two parameters in the model building that you used in d)

above. Do you expect that manipulating the parameter, in the same way, will

improve accuracy for other types of datasets? Justify your answer.[5 marks]


  1. f) Provide and carefully examine the Confusion Matrix. Is there any significant

finding in regards to your selected variables? If yes, why do you think this

happens?[5 marks]


  1. g) Generate and provide classification report, showing precision, recall, f1 and

support scores, to evaluate your model performance. Describe your findings.[5 marks]


Study Area II (Breast Cancer Diagnostic Dataset)

Breast cancer is the most common cancer amongst women in the world. It accounts

for 25% of all cancer cases, and affected over 2.1 Million people in 2015 alone. It

starts when cells in the breast begin to grow out of control. These cells usually form

tumors that can be seen via X-ray or felt as lumps in the breast area. The key

challenges against it’s detection is how to classify tumors into malignant (cancerous)

or benign (non cancerous). You can find more information about this dataset and its

attributes in the metadata file under Assignment 1 materials.


For this dataset, you will also use both the Decision Tree classifier and Naïve Bayes

(NB) algorithms to build a predictive model these tumors. For both methods use the

10-fold cross validation option for testing.


  1. a) Perform Exploratory Data Analysis (EDA) and describe your dataset. Explain

any pre-processing and data manipulation task you performed to prepare your

dataset for feature modelling. Present and discuss in detail your findings using

both tabular and graphical formats. Note: no grade will be given if presenting the

plot/tables without explanation.[10 Marks]


  1. b) Use an appropriate method of feature selection to identify significant features.

State the method used and list the features produced and explain why this feature

reduction method was used. Discuss the independence assumption between the

features in Naïve Bayes (NB) algorithm and support your answer with reference

to the selected features.[15 marks]


  1. c) Run the Naïve Bayes algorithm with the GaussianNB implementation for the

selected features. Provide the metrics to evaluate the performance of the NB

model and discuss the results.[15 marks]


  1. d) Run the Decision Tree Classifier algorithm and compare the list produced in part

(b) with the selected significant features produced by the Decision Tree model.[5 marks]


  1. e) Identify similarities and differences of both classification results. Discuss any

differences.[5 marks]