1. Bagging. (? points)
In this question, you will use neural nets and bagging to classify images of handwritten
digits from MNIST, a benchmark machine-learning dataset. You will use sklearn to
define, train and test neural nets, just as you did in Question 4 of Assignment 2.
To start, download and uncompress (if necessary) the MNIST data file from the course
web page. The file, called mnistTVT.pickle.zip, contains training, validation and
test data. Next, start the Python interpreter and read the file with the following
with open(’mnistTVT.pickle’,’rb’) as f:
Xtrain,Ttrain,Xval,Tval,Xtest,Ttest = pickle.load(f)
The variables Xtrain and Ttrain contain training data, while Xval and Tval contain
validation data, and Xtest and Test contain test data.
Xtrain is a Numpy array with 50,000 rows and 784 columns. Each row represents a
hand-written digit. Although each digit is stored as a row vector with 784 components,
it actually represents an array of pixels with 28 rows and 28 columns (784 = 28 × 28).
Each pixel is stored as a floating-point number, but has an integer value between 0 and
255 (i.e., the values representable in a single byte). The variable Ttrain is a vector of
50,000 image labels, where a label is an integer betwen 0 and 9. For example, if row n
of Xtrain is an image of the digit 7, then Ttrain[n] = 7. Likewise for the validation
and testing data, which contain 10,000 images each.
To view a digit, you must first convert it to a 28 × 28 array using the function
numpy.reshape. To display a 2-dimensional array as an image, you can use the function
imshow in matplotlib.pyplot. To see an image in black-and-white, add the keyword
argument cmap=’Greys’ to imshow. To remove the smoothing and see the 784 pixels
clearly, add the keyword argument interpolation=’nearest’. Try displaying a few
digits as images. (Figure 1 shows some examples.) For comparison, try printing them
as vectors. (Do not hand this in.)
In answering the questions below, do not use any Python loops, except where explicity
allowed. All code should be vectorized.
(a) (0 points) Create a small training set consisting of the first 500 training points.
Use this small training set in the rest of this question.
(b) Using MLPclassifier in sklearn, create a neural network classifier with 5 hidden
units, a logistic activation function and no L2 penalty (i:e:, no regularization).
Fit the classifier to the small training set. Use stochastic gradient descent with
an initial learning rate of 0.1 and at most 10,000 epochs of training. Using the
score method, compute and print out the training and validation accuracies. The
training accuracy should be 1.0, and the validation accuracy should be close to
0.7. The very first line of your code should set the random seed to 7.
(c) Why is the training accuracy higher than the validation accuracy in part (b)?
You should have expected this to happen in this case. Why? Hint: consider the
size of the training set and the number of weights in the neural net.