We recommend editing and running your code in Google Colab, although you are welcome to use your local machine instead.
Problem 8.1 Autoencoders (5 pts)
We’ll start by implementing a simple self-supervised learning method: an autoencoder. The autoencoder is composed of an encoder and a decoder. The encoder often compresses the original data with a funnel-like architecture, i.e., it throws away redundant information by reducing the layer sizes gradually. The ﬁnal output size of the encoder is a bottleneck that is much smaller than the size of the original data. The decoder will use this limited amount of information to reconstruct the original data. If the reconstruction is successful, the encoder has arguably captured a useful, concise representation of the original data.
Such representations could help with downstream tasks such as object recognition, semantic segmentation, etc. Here, to test the usefulness of the representation, we’ll train the encoders on the STL-10 dataset, which is designed to evaluate unsupervised learning algorithms. This dataset contains 100,000 unlabeled images, 5,000 labeled training images, and 8,000 labeled test images. To keep training time short, we’ll use 10,000 unlabeled images to learn representations.
We’ll then use the feature representation that we learned to train an object recognition model
(a simple linear classiﬁer) on the 5,000 labeled training images. If the learned representations are useful, we should obtain a performance improvement over only using the small, labeled training set.
1. We will build a small convolutional autoencoder and train it on the STL-10 dataset. The conv layers in the autoencoder all have kernel size = 4×4, stride = 2, padding = 1.
2. With the trained autoencoder, we freeze the parameter of the encoder and train a linear classiﬁer on the autoencoder representations, i.e., the output of the encoder. You will compare
Figure 1: Sample images from STL-10 dataset.
the accuracy of the linear classiﬁer with two other linear classiﬁers. One is trained together with the encoder and the other one is trained on top of a randomly initialized encoder. Conﬁrm that the unsupervised pretraining improves the classiﬁcation accuracy compared to the random baseline. Method I should achieve about 30% accuracy on the test set. Method II should achieve above 40% accuracy. Method III performs the worse among these three.
List of functions/classes to implement:
1. class Encoder (1 pt)
2. class Decoder (1 pt)
3. def train ae (1 pt)
4. def train classfier and set the supervised parameter in three methods (1 pt)
5. Report results in the Report results section at the end of the notebook (1 pt)
Problem 8.2 Contrastive Multiview Coding (5 pts)
We covered contrastive learning (CL) in lecture 15. CL is an approach of self-supervised learning [1, 2, 3] that avoids the need to explicitly generating images. Here, we’ll implement a recent contrastive learning method, Contrastive Multiview Coding (CMC) . We’ll learn avector representation for images: in this representation, two artiﬁcially corrupted versions of the same image should have a large dot product, while dot products of two diﬀerent images should have a small dot product. In CMC, these corruptions are views of an image that contain complementary information. For example, in this problem set, our views will be luminance (i.e. grayscale intensity) and chromaticity (i.e. color) in the Lab color space. A good representation should create similar vectors for these two views (i.e. that have a large dot product), and they should therefore contain the information that is shared between the views. We’ll minimize the loss