Throughout this project we developed three different deep neural networks that were used to solve this challenge. Below we detail how we used each one.
- Semantic Segmentation Network
- Variational Auto Encoder
- Triplet loss clustering Network
Network 1 – Semantic Segmentation Network
This neural network takes an image from the camera and determines for each pixel in the image whether or not it is likely to be a brick.
The base neural network that we used was Model A from the paper Wider or Deeper by Wu et al. This is a very wide ResNet with 38 hidden layers and roughly 124 million parameters. We used an open source, pre-trained, TensorFlow implementation of this network that has been adapted specifically for image segmentation in video. This is a very modern network architecture (2017), and currently has some of the best results for image classification (ImageNet) and image segmentation (PASCAL). The implementation we used was developed by Paul Voigtlaender for this paper and the code can be found online.
The original network was pre-trained on ImageNet, Microsoft COCO and PASCAL by Wu et al. Then Voigtlaender replaced the output layer with a two-class softmax and trained the network for “objectness” on the PASCAL VOC 2012 dataset to be able to tell foreground “objects” from background. These were the pretrained weights that we used for our network.
Once we had this pre-trained segmentation network set-up we needed to train it to identify LEGO bricks. Generating training data for such a network is quite different as we need hand-segmented pixel masks, which differentiated which pixels belonged to bricks, from the pixels that didn’t. We trained our network in an iterative boot strapping approach. We started by hand generating a small training set of 29 images, which contained 15 different brick types. Each image contained only one bricks on the background of the carpet which we would currently be running our robot on. The images had random non-brick objects such as bananas, cups, chairs, table legs, and people’s hands in order to allow the network to learn to differentiate bricks from other objects that it might see when running.
Even with such a small training set, the results of the network were spectacular. It could very accurately segment bricks that it had never seen before. The network still sometimes gave false positives such as when other objects (say someones shoe) came into the network that it had never seen before. It also had a lot of trouble with lighting changes and background changes. However, in general, with so little training data, we could get accurate enough segmentations to localize the bricks and enable the robot to accurately drive towards them, and pick them up.
The next step to improving the segmentation output of this network was to generate more training data. We tried to generate as much training data in as automatic way as possible. To do this we used the code to perform on-line adaptive one shot image segmentation by Voigtlaender. What this means is that we took training video examples of one or multiple bricks in the workspace with many other random objects present. We then hand segmented only the very first frame of each of these videos. We could then use this network to one-shot segmentation to predict very accurate segmentations for the rest of the frames in the same video. We then went through by hand and removed all the segmentations that were not accurate enough and added the good enough ones to our training set.
The online adaptive one-shot segmentation works as follows. First the network is trained for 20 epochs on the hand-segmented first image. Then the network predicts a segmentation for the next frame. From this segmentation it labels pixels that it predicts are the same brick with greater than 97% accuracy as positive samples, it labels pixels that are a certain distance away from the brick as negative examples, and the rest of the pixels as neutral or ignored. The network then iterates between training on this new sample and on the original hand segmented image to prevent drift. This is then repeated for all frames of the video and very accurate segmentations result.
Using this method we trained on fourteen videos with around 100 frames each, with each frame separated by around half a second, for a total dataset size of around 1400 segmented images. We ensured that we took video samples throughout all times of the day, and tried to put as many non-brick items as possible into the videos so that the network could accurate differentiate between bricks and non-bricks.
For the next (and so far final) round of training we took a further 60 videos of bricks, each of which had around 50 frames also spaced around half a second apart each. We then had our network segment all of these images. Instead of segmenting the first image by hand like we did in the previous training iteration, now for each video we chose the frame for which the network made the best segmentation and used this one frame as the ground truth segmentation to do the one-shot video segmentation to improve the segmentation results. After doing this we had training set with around 75 different bricks (including combination bricks) with around 4500 accurate segmentations to train the network on.
Network 2 – Variational Auto Encoder (VAE)
VAE is a deep generative model that uses Bayesinan inference to learn the probability distribution that is likely to have produced the observations (bricks), such that one can generate unseen samples from it. In fact this latent probability distribution should capture the essence of brickness and this is exactly what we want the robot to have. So in our case of bricks we would expect the VAE to pick up different features of the bricks such as color, shape and angle.
While there are several tutorials on VAEs online, we decided to use the implementation from the Edward library. As probably all of the examples one can easily find online, this one uses MNIST data set. Thus the model needed to be modified to work on lego bricks images instead. In order for that to work we changed the generative and inference network to accommodate a different input size and used the Gaussian latent distribution rather than Bernoulli to encode a continuous latent space.
We used a mixture of synthetic images and actual photos resized to the same format, normalized and equalized in order to train our VAE.
Network 3 – Triplet Loss Brick Clustering
The second method that we used for unsupervised brick clustering worked on a slightly different premise. The VAE learned a latent space that allowed it to generate realistic image of bricks, and didn’t specifically force the latent space variables of different image of the same brick to be similar. The second approach works by taking as one data sample not individual images of bricks, but rather videos of singular bricks, as the camera moves at different angles around the brick, as such the training samples are still completely unlabeled and this is still unsupervised learning of brick representations, but we can use the meta information from the video that each frame in the video contains the same brick to train a neural network to map bricks into a discriminative latent space.
This network works by using a 38 layer wide ResNet, pre-trained on image net for image classification, with the last softmax layer removed and replaced with a 32 variable triplet loss layer, that is trained using a triplet loss. This works by taking a batch of 8 random images from each video, for 16 different videos of 16 different bricks. A loss is then calculated using “Batch-Hard” mode. This means that for each of the 128 images in the batch, one contribution to the loss is calculated by finding the image in the batch from the same video that is the most dissimilar, and the image in the batch from one of the other videos that is the most similar and then adjusting the weights so that the distance between the same brick is minimized while the distance to different bricks is maximized.
This was then trained on a dataset of 60 different videos of bricks, each containing approximately 50 images each, The result is that we have trained a network to transform images into a 32 dimensional latent space where images of the same brick and clustered very closely even at different viewing angles and lighting conditions, whereas the representation of different bricks is further apart. Thus we can compare the similarity of say a given brick to a reference brick by just calculating their L2 distance in the latent space. And we can cluster real life bricks into groups by clustering in this latent space.
Putting all the networks together for Achievement 7:
Achievement 7 was to find a way to perform unsupervised clustering of bricks. In order to do this, we put a series of random bricks around the robot, had the robot spin around so that is saw all of the bricks, taking many images as it rotated around. It then used the segmentation network on these images to extract the bricks from the images, then it fed each of these extracted brick images through the triplet loss network to extract a 32 dimensional latent space representation for each brick, that was clustered using a number of different clustering algorithms. But we found the best one was using a Gaussian Mixture model to cluster into N groups.
Here you can see some images of bricks saved by the robot on the go:
Results of this process are below: