Last monday was judgment day for our dear ROB, he needed to cluster bricks by color and put them into the correct box while evading objects. Furthermore people tried to confuse it by using print outs of lego bricks on carpet (did not fool ROB), leaves (did fool ROB) and using ensembles of lego bricks (did not fool ROB). In the end the jury awarded ROB 26 out of 30 points, the highest score out of the 5 robots, and we could not be more proud of him. Below we give a small summary of what made ROB do robot things.
ROB needed to do 3 things: carry the Jetson, grab and lift bricks, and detect bricks. In the image below we showcase how ROB does these things.
In the left image you can see the Jetpack, a firm lego frame in which we can place the Jetson. The top right image shows the two camera's. We are using the bottom one to spot bricks and obstacles, while the top camera is used purely for checking which brick is in between the grabbers. Also note that the distance to the ground is always the same this enables us to estimate the size of the brick. The last and most awesome part of the robot is the gripper, which is capable of picking up (almost) any brick and by using this elevator mechanism we can access boxes which are about 10 centimeters high.
Since lego detection was working, we wanted to tackle generic object detection as well. Below we will go over the final version of our vision system which is a single network which handles the lego localization and the generic object localization and can run on the Jetson at 4-5 FPS.
As with our previous vision network we train our vision network with weakly supervised learning for the position. Practically speaking this means that our network has to learn the position of the lego brick without us correcting it (this means that we dont have manually annotate the data). You can read more about it here.
So what is new? We are using pretrained weights since it allows us to do generic object detection and makes the lego heat map far more robust. The robustness is mostly due the pretrained weights which have seen a much wider range of objects/lighting conditions then the 2000 images we have.
We use the first 4 blocks of (frozen) VGG16 weights, which we found was the best comprise between running speed and accuracy. The firsts VGG16 block activates on almost everything, the second block activates on everything with sharp lines, while the filters in the last 2 blocks practically only activate on objects. We tested all the different in these blocks and the third convolutional layer of block 3 gave the nicest heatmap for objects. By subtracting the lego heatmap we can distinguish the lego from the generic objects.
To summarize, by using the VGG16 weights our lego heatmap drastically improves and we get generic object detection for free while still having an acceptable FPS.
For clustering we required a way to distinguish bricks by colors (and shapes) and preferably in a completely unsupervised way. This sounds way easier than it actually is, we dabbled around with statistical tests, kNN, k-means, t-SNE and variational auto encoders tested it in various color spaces and tried multiple layers of preprocessing. None of the previously mentioned methods worked very well since the camera was not producing the same colors when the lighting changed, for example yellow became reddish in the dark and white in bright light conditions. This convinced us to use a siamese network to create an embedding.
A simplified siamese network is shown in the image above. We trained it by giving the same convolutional 'tower' two images, which had either the same or a different color. The tower would output a vector of length 3 per image. By forcing the siamese network to recognize whether the images were of a similar color or not, the tower became an 'embedding' which knows the color regardless of the lighting/pose/shape of the brick. This embedding is shown in the right, as you can see the colors are grouped up nicely (except for a few blue dots where the original image shows an almost fully white brick). Using this embedding we can cluster the colors of the bricks in a satisfactory way.
The past few weeks were a blast, we learned a ton and we would like to thank our mentor Philip (for his unwavering faith in us ;)), and VW for providing the opportunity to work on an awesome project. The details/code of our project can be found on github.