Please upload the resulting jupyter notebook as pdf with the answers to the questions inline. You can do this by “printing” the jupyter notebook and then selecting “save as pdf”. Make sure to include your NetID and real name in your submission.
a. Use the file “assignment_1.csv” from the first assignment for this task. The target column is “gold_price”.
b. Plot the autocorrelation function (ACF) and partial autocorrelation function (PCF) of the cases timeseries. (see https://www.statsmodels.org/stable/generated/statsmodels.tsa.stattools.acf.html)
c. Describe what the plots indicate (in terms of autocorrelation and autoregressive parameter (p) and moving average (q)).
Some rules of thumb to recall:
i. Rule 1: If the ACF shows exponential decay, the PACF has a spike at lag 1, and no correlation for other lags, then use one autoregressive (p) parameter
ii. Rule 2: If the ACF shows a sine-wave shape pattern or a set of exponential decays, the PACF has spikes at lags 1 and 2, and no correlation for other lags, the use two autoregressive (p) parameters. iii. Rule 3: If the ACF has a spike at lag 1, no correlation for other lags, and the PACF damps out exponentially, then use one moving average (q) parameter.
iv. Rule 4: If the ACF has spikes at lags 1 and 2, no correlation for other lags, and the PACF has a sine-wave shape pattern or a set of exponential decays, then use two moving average (q) parameters.
v. Rule 5: If the ACF shows exponential decay starting at lag 1, and the PACF shows exponential decay starting at lag 1, then use one autoregressive (p) and one moving average (q) parameter.
d. Determine how many times you need to differentiate the data and perform the analysis on the n times differentiated data.
e. Another approach to assessing the presence of autocorrelation is by using the Durbin-Waton (DW) statistic. The value of the DW statistic is close to 2 if the errors are uncorrelated. What is DW for our data, and does this match what you observed from the ACF and PCF plots? (see https://www.statsmodels.org/stable/generated/statsmodels.stats.stattools.durbin_ watson.html – in fact the statsmodels package provides a lot of the functionality needed for time analysis)
f. Removing serial dependency by modeling a simple ARMA process with p and q as derived above. Take a look at what the resulting process looks like (plot)
g. Calculate the residuals, and test the null hypothesis that the residuals come from a normal distribution, and construct a qq-plot. Do the results of the hypothesis test and qq-plot align?
h. Now investigate the autocorrelation of your ARMA(p,q) model. Did it improve? These can be examined graphically, but a statistic will help. Next, we calculate the lag, autocorrelation (AC), Q statistic and Prob>Q. The Ljung–Box Q test is a type of statistical test of whether any of a group of autocorrelations of a time series are different from zero. The null hypothesis is, H0: The data are independently distributed (i.e. the correlations in the population from which the sample is taken are 0, so that any observed correlations in the data result from randomness of the sampling process).
i. Compute predictions for years 2000 and after, as well as, 2010 and after and analyze their fit against actual values.
j. Calculate the forecast error via MAE and MFE.
Reminders: Mean absolute error: The mean absolute error (MAE) value is computed as the average absolute error value. If MAE is zero the forecast is perfect. As compared to the mean squared error (MSE), this measure of fit “de-emphasizes” outliers (unique or rare large error values will affect the MAE less than the MSE.
Mean Forecast Error (MFE, also known as Bias). The MFE is the average error in the observations. A large positive MFE means that the forecast is undershooting the actual observations. A large negative MFE means the forecast is overshooting the actual observations. A value near zero is ideal, and generally a small value means a pretty good fit.
The MAE is a better indicator of fit than the MFE.
a. Load the file “assignment_3.csv”. This file contains doctor analyses of breast cancer images. The target is the “pathology” whether the tissue is “benign” (= no cancer) or “malignant” (= has cancer).
b. Explore and clean-up the data. Additionally, convert categorical columns into numerical columns that can be understood by a machine learning model. The final data must have numerical columns only (except target) and two target classes.
Operations that might be necessary:
i. One-hot-encoding (i.e., add column for each category)
ii. Binarization (i.e., representing one category as 0 and the other as 1)
iii. Merging categories (e.g., if categories are too similar or if categories only appear a few times)
iv. Removing columns without information
v. Converting categories to numbers (i.e., introducing an order)
vi. Converting numbers to categories (e.g., if values do not represent numbers)
c. Prepare a training set (75%) and test set (25%) by sampling the data without replacement and without overlaps.
d. If we had to, how would we prove to ourselves or a colleague that our data was indeed randomly sampled on X? And by proof, I mean empirically, not just showing this person our code. Don’t actually do the work, just describe in your own words a test you could here. Hint: think about this in terms of selection bias.
e. Now build and train a decision tree classifier using DecisionTreeClassifier (http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifi er.html) on the training set to predict the target variable. Use criterion=’entropy’. For all other settings use all of the default options.
f. Using the resulting model show a bar plot of feature names and their feature importance (hint: check the documentation of DecisionTreeClassifier)
g. Compute the accuracy of your model on the test set.
3. Hyper-Parameter Optimization
a. The default options for your decision tree may not be optimal. We need to analyze whether tuning the parameters can improve the accuracy of the classifier. For the following options “min_samples_split” and “min_samples_leaf”.
Generate a list of 10 values of each for the parameters “min_samples_split” and “min_samples_leaf”. Explain in words your reasoning for choosing the above ranges.
b. In order to not overfit on our test data, create a 5-fold cross validation data set on the training data. (partition the training data into 5 equally sized folds)
c. Looping through the different model parameters, for each model and fold, train the model on the training data not in the fold and use the fold as test data. Compute AUC as a test metric of a fold. To obtain a comparable metric for a model configuration use “mean AUC – stderr AUC” (the difference of mean and standard error of the AUCs of all folds). Use functions to separate different parts of this process and make it reusable for different models / parameters / data.
d. Plot the obtained comparable AUC metrics of the parameters as a line chart (x-axis: parameter; y-axis: comparable AUC metric). Can you find a clear optimum? (e.g., if the best metric is at the end of the value range expand the range into that direction until you are able to make out a peak)
e. Now use the above functions to find best models for LogisticRegression (no additional parameters), RandomForestClassifier (number of trees and depth), SVM (parameter C, choose exponential values, e.g., 10^(-8), …, 10^1), and Multi Layer Perceptron (depth of network, breadth of network).
f. Use the best model of each architecture and compute the AUC on the actual test data. (Note, here no aggregation is needed) Which model was the best on the training data? Does the overall best model also hold up on the test data? Plot the ROC curves of the models from the test data on top of each other.