The video above shows the results of training a Variational Auto-Encoder on LEGO blocks. The resulting clustering is not perfect, but is much better than other results we got previously. In previous attempts the main problem was that clusters focused more on shape than color.
The Variational Auto-Encoder has 3 latent variables, is trained for only 200 gradient descent steps, each computed for the full dataset of less than 200 samples. Additionally, the Auto-Encoder is not trained to reconstruct the pixels of the input that are not part of a LEGO block. This is done by masking the loss before computing the sum over batch and pixels. Thanks to this short training and relaxation of the loss, the VAE does not learn to reconstruct detailed features like shape, but only a colored blur. You can see an example below: