Putall your workinto a file titled BUSA8001_programming_task2_MQ_ID.ipynb where MQ_ID is your Macquarie University student ID number (e.g. if MQ_ID == 12345678 then youneed to submit BUSA8001_programming_task2_12345678.ipynb).
•Failure to submit a correctly named file will result in a loss of 30 points.
•Failure to supply solutions in the cells provided below each question will result in a loss of 30points.
•Follow all instructions closely and not print your variables to screen unless explicitly asked todo so. Failure to do so will result in additional point deductions.
Problem 1 – (30 points)
Perform the following tasks in python, writing your code in the cells provided underneatheach question.
Q1. Import the credit card data from https://archive.ics.uci.edu/ml/machine-learning-databases/00350/default of credit ard clients.xls directly into a pandas DataFrame named `df` making sure you skip the top row when reading the dataset. Delete the ‘ID” column after importing the data. (5 points)
Q2. Rename the column ‘PAY_0’ to ‘PAY_1’ and the column ‘default payment next month’ to ‘payment_default’ (5 points)
Q3. Create a one-dimensional NumPy array named `y` by exporting the first 12,500 observations of ‘payment_default’ column from df (hint: see `ravel` NumPy method). Similarly, create a two-dimensional NumPy array named `X` by exporting the first 12,500 observatations of ‘PAY_1’, ‘PAY_2’, ‘AGE’, ‘SEX’, ‘MARRIAGE’, ‘EDUCATION’ and ‘BILL_AMT1’ columns. (10 points)
Q4. Use an appropriate `scikit-learn` library we learned in class to create the following NumPy arrays: `y_train`, `y_test`, `X_train` and `X_test` by splitting the data into 68% train and 32% test datasets. Set `random_state` to 3 and stratify subsamples so that train and test datasets have roughly equal proportions of the target’s class labels. (5 points)
Q5. Use an appropriate `scikit-learn` library we learned in class to standardize features from train and test datasets to mean zero and variance one, as discussed in class. (5 points)
Problem 2 – (30 Points)
Q6. Using approapriate `scikit-learn` libararies we learned in class to fit the following classifiers to the training dataset constructed in Problem 1.
- Logistic Regression – name your instance `lr` set `random_state=11`
- Support Vector Machine with Linear Kernel – name your instance `svm_linear` set `C=6.0` and `random_state=11`
- Support Vector Machine with RBF Kernel – name your instance `svm_rbf` set `gamma = 21`, `C=5.6`, `random_state=11`
- Decision Tree – name your instance `tree` set `criterion=’entropy’`, `max_depth = 4`, `random_state=11`
- Random Forest – name your instance `forest` set `criterion=’entropy’`, `n_estimators=21`, `random_state=11`
- KNN – name your instance `knn` set `n_neighbors=6`, `p=3`, `metric=’minkowski’`When initializing instances of the above classifiers only set parameters provided above and leave all other parameters equal to their `scikit-learn` default values. (30 points)
Problem 3 – (40 points)
Q7. Using a method built into each of the above classifiers, compute prediction accuracy on training data for each classifier and store it into variables named according to the following pattern: classifier_name_accuracy_train`, for instance you should have `lr_accuracy_train`. (10 points)
Q8. Using a method built into each of the above classifiers, compute prediction accuracy on test data for each classifier and store it into variables named according to the following pattern:classifier_name_accuracy_test`, for instance you should have `lr_accuracy_test`. (10 points)
Q9. Explain which methods rank in the first two places according to their ability to accurately classify train data, and which two methods perform worst on train dataset? (10 points)
- Exaplain which methods rank in the first two places according to their ability to accurately classify test data, and which two methods perform worst on test dataset? (3 marks)
- How do these accuracies compare with the ones reported in Q9? Is this expected, and why (or why not)? (7 marks)