这是一篇美国关于python数据挖掘代码代写
AIMS
This assignment gives you an opportunity to solve two real-word data mining
problems using the machine learning workbench. In the two questions given below
justification of your answers carries a high proportion of the marks awarded. You
are required to conduct experiments for both case studies and report them according
to the specified requirements.
Study Area I (Dataset is Boat Dataset)
This case study is concerned with decision making process to help a customer to
evalute the condition of a boat. The dataset contains 6 attributes for which one of 4
class lables are assigned to as boat condition. You can find more information about
the dataset and its attributes in metadata file under Assignment 1 materials.
You are required to build a model using the Decision Tree Classifier and answer the
following questions based on the model built. Use the data segment on the
subscriptions whose outcomes are known. In building the model, use the 10-fold
cross validation option for testing.
Your answers below need to be supported by suitable evidence, wherever
appropriate. Some examples of suitable evidence are the Confusion Matrices, Model
Visualizations and Summary Statistics.
- a) Describe the pre-processing you have performed to prepare your data.[5 Marks]
- b) Using an appropriate method identify the most influential features in classifying
this dataset. Explain the process of the chosen feature selection method and the
number of selected features for your dataset.[5 marks]
- c) Perform initial data exploration by analyisng the summary statiscs of the selected
variables. Identify the variance for each variable and describe their distribution
using appropriate plot(s).[5 marks]
- d) Now build a model using the Decision Tree algorithm. By adjusting two suitable
parameters (one at a time) reduce the size of the tree to not more than 10 to 15
nodes in order to improve the interpretability of the model generated. Which of
the two parameters yielded better accuracy while producing smaller trees?
Analyse your findings and discuss the results. Visulaise the final generated
decision tree and describe it.[10 marks]
- e) Describe the role of the two parameters in the model building that you used in d)
above. Do you expect that manipulating the parameter, in the same way, will
improve accuracy for other types of datasets? Justify your answer.[5 marks]
- f) Provide and carefully examine the Confusion Matrix. Is there any significant
finding in regards to your selected variables? If yes, why do you think this
happens?[5 marks]
- g) Generate and provide classification report, showing precision, recall, f1 and
support scores, to evaluate your model performance. Describe your findings.[5 marks]
Study Area II (Breast Cancer Diagnostic Dataset)
Breast cancer is the most common cancer amongst women in the world. It accounts
for 25% of all cancer cases, and affected over 2.1 Million people in 2015 alone. It
starts when cells in the breast begin to grow out of control. These cells usually form
tumors that can be seen via X-ray or felt as lumps in the breast area. The key
challenges against it’s detection is how to classify tumors into malignant (cancerous)
or benign (non cancerous). You can find more information about this dataset and its
attributes in the metadata file under Assignment 1 materials.
For this dataset, you will also use both the Decision Tree classifier and Naïve Bayes
(NB) algorithms to build a predictive model these tumors. For both methods use the
10-fold cross validation option for testing.
- a) Perform Exploratory Data Analysis (EDA) and describe your dataset. Explain
any pre-processing and data manipulation task you performed to prepare your
dataset for feature modelling. Present and discuss in detail your findings using
both tabular and graphical formats. Note: no grade will be given if presenting the
plot/tables without explanation.[10 Marks]
- b) Use an appropriate method of feature selection to identify significant features.
State the method used and list the features produced and explain why this feature
reduction method was used. Discuss the independence assumption between the
features in Naïve Bayes (NB) algorithm and support your answer with reference
to the selected features.[15 marks]
- c) Run the Naïve Bayes algorithm with the GaussianNB implementation for the
selected features. Provide the metrics to evaluate the performance of the NB
model and discuss the results.[15 marks]
- d) Run the Decision Tree Classifier algorithm and compare the list produced in part
(b) with the selected significant features produced by the Decision Tree model.[5 marks]
- e) Identify similarities and differences of both classification results. Discuss any
differences.[5 marks]