Best代写-最专业靠谱代写IT | CS | 留学生作业 | 编程代写Java | Python |C/C++ | PHP | Matlab | Assignment Project Homework代写

Python代写|BUSA8001 – Programming Task 3

Python代写|BUSA8001 – Programming Task 3



Problem 1 – Total Marks: 7.5

Q1. Read the credit card dataset from Programming Task 2 into a DataFrame named df and Rename the columns ‘PAY_0’ and ‘default payment next month’ as in Programming Task 2

Delete ‘ID’ column

Print columns of df

Print shape of df

(2.5 marks)

Q2. Feature Engineering – Create additional features and add them to df by squaring the following variables


All BILL_AMT variables

All PAY_AMT variables

Name the new variables by appending _2 to the existing variables that you transformed, e.g. LIMIT_BAL_2

(5 marks)

Problem 2. Cleaning data and dealing with categorical features – Total Marks: 22.5


Print value_counts() of ‘SEX’ column and add dummy variables ‘SEX_MALE’ and ‘SEX_FEMALE’ to df using get_dummies() .

Make sure that the original SEX variable is removed from df . (2.5 marks)

Carefully explain how the new variables are constructed. (1.5 marks)

Q2. Print value_counts() of ‘MARRIAGE’ column, provide its definition, and carefully comment on what you notice in relation to the definition of this variable. (2.5 marks)

Q3.Use get_dummies() on ‘MARRIAGE’ and add dummy variables ‘MARRIAGE_MARRIED’, ‘MARRIAGE_SINGLE’, ‘MARRIAGE_OTHER’ to df . Allocate all values of ‘MARRIAGE’ across the 3 newly created features appropriately. Make sure that the orignial ‘MARRIAGE’ variable is removed from df . (5 marks)

Explain how you created the new features and what decisions you had to make. (3.5 marks)

Q4. In the column ‘EDUCATION’, convert values {0, 4, 5, 6} into the value 4. (7.5 marks)

Problem 3 Preparing X and y arrays – Total Marks: 5

Q1. Create y from 12,500 consecutive observations starting from observation 1,000, i.e. observation 1,000 is the starting point, of ‘payment_default’ column from df. Similarly, create X using 12,500 corresponding observatations of all the remaining features in df (2.5 marks)

Q2. Use an appropriate scikit-learn library we learned in class to create y_train , y_test , X_train and X_test by splitting the data into 70% train and 30% test datasets.

Set random_state to 2 and stratify subsamples so that train and test datasets have roughly equal proportions of the target’s class labels.

(2.5 marks)

Problem 4. Optimize hyperparameters using grid search and SVC – Total Marks: 40

Q1. Use make_pipeline to create a pipeline called pipe_svc consisting of:

  • StandardScaler
  • PCA (set random_state to 1)
  • SVC (set random_state to 1)

(10 marks)

Q2. Use GridSearchCV to create gs object, fit the model and tune the following hyperparameters

  • SVC parameter – grid search over the following values [0.1, 1, 10]
  • SVC kernel – grid search over 3 alternatives: linear, sigmoid, and rbf
  • Number of PCA components – grid search over the following 3 values [1, 4, 9]
  • When implementing GridSearchCV set the following options (leaving everying else to their default values)

accuracy for scoring

refit to True

number of cross-validation folds to 10


(20 marks)

Q3. Using the best model optimised by grid-search print the following

cross-validation best_score_

accuracy for the training set

accuracy for the test set

(10 marks)

Problem 5. Confusion Matrix – Total marks: 25

Q1. Use the best fitted model of gs to print the confusion matrix. (5 marks)

Q2. Plot the confusion matrix, and on its basis compute the True Positive Rate, False Positive Rate and Precision. (10 marks)

Q3. Looking at the confusion matrix values and the three quantities that you computed what is the greatest source of risk to the credit card company should it rely on the predictions constructed by our model optimised for accuracy ?

Explain your answer in detail. (10 marks)

Provide answer here