This assignment gives you an opportunity to solve two real-word data mining problems using the machine learning workbench. In the two questions given below justification of your answers carries a high proportion of the marks awarded. You are required to conduct experiments for both case studies and report them according to the specified requirements.
Study Area I (Dataset is Boat Dataset)
This case study is concerned with decision making process to help a customer to evalute the condition of a boat. The dataset contains 6 attributes for which one of 4 class lables are assigned to as boat condition. You can find more information about the dataset and its attributes in metadata file under Assignment 1 materials.
You are required to build a model using the Decision Tree Classifier and answer the following questions based on the model built. Use the data segment on the subscriptions whose outcomes are known. In building the model, use the 10-fold cross validation option for testing.
Your answers below need to be supported by suitable evidence, wherever appropriate. Some examples of suitable evidence are the Confusion Matrices, Model Visualizations and Summary Statistics.
a) Describe the pre-processing you have performed to prepare your data.[5 Marks]
b) Using an appropriate method identify the most influential features in classifying this dataset. Explain the process of the chosen feature selection method and the number of selected features for your dataset.[5 marks]
c) Perform initial data exploration by analyisng the summary statiscs of the selected variables. Identify the variance for each variable and describe their distribution using appropriate plot(s).[5 marks]
d) Now build a model using the Decision Tree algorithm. By adjusting two suitable parameters (one at a time) reduce the size of the tree to not more than 10 to 15 nodes in order to improve the interpretability of the model generated. Which of the two parameters yielded better accuracy while producing smaller trees?
Analyse your findings and discuss the results. Visulaise the final generated decision tree and describe it.[10 marks]
e) Describe the role of the two parameters in the model building that you used in d) above. Do you expect that manipulating the parameter, in the same way, will improve accuracy for other types of datasets? Justify your answer.[5 marks]
f) Provide and carefully examine the Confusion Matrix. Is there any significant finding in regards to your selected variables? If yes, why do you think this happens?[5 marks]
g) Generate and provide classification report, showing precision, recall, f1 and support scores, to evaluate your model performance. Describe your findings.[5 marks]
Study Area II (Breast Cancer Diagnostic Dataset)
Breast cancer is the most common cancer amongst women in the world. It accounts for 25% of all cancer cases, and affected over 2.1 Million people in 2015 alone. It starts when cells in the breast begin to grow out of control. These cells usually form tumors that can be seen via X-ray or felt as lumps in the breast area. The key challenges against it’s detection is how to classify tumors into malignant (cancerous) or benign (non cancerous). You can find more information about this dataset and its attributes in the metadata file under Assignment 1 materials.
For this dataset, you will also use both the Decision Tree classifier and Naïve Bayes (NB) algorithms to build a predictive model these tumors. For both methods use the 10-fold cross validation option for testing.
a) Perform Exploratory Data Analysis (EDA) and describe your dataset. Explain any pre-processing and data manipulation task you performed to prepare your dataset for feature modelling. Present and discuss in detail your findings using both tabular and graphical formats. Note: no grade will be given if presenting the plot/tables without explanation.[10 Marks]
b) Use an appropriate method of feature selection to identify significant features.
State the method used and list the features produced and explain why this feature reduction method was used. Discuss the independence assumption between the features in Naïve Bayes (NB) algorithm and support your answer with reference to the selected features.[15 marks]
c) Run the Naïve Bayes algorithm with the GaussianNB implementation for the selected features. Provide the metrics to evaluate the performance of the NB model and discuss the results.[15 marks]
d) Run the Decision Tree Classifier algorithm and compare the list produced in part (b) with the selected significant features produced by the Decision Tree model.[5 marks]
e) Identify similarities and differences of both classification results. Discuss any differences.[5 marks]