A Discussion on Deep Learning Based Control Strategies

Purple-Team

Background

We have previously seen that using the segmented mask the robot performs fairly well in identifying and picking up the lego block while avoiding(by dodging or placing away) obstacles in its path. However, since the control strategy was manually engineered, it would be a monumental task to take into consideration all the plausible scenarios that robot and its environment may encounter. Therefore, for conditions which were not considered while engineering the control logic (e.g. moving shadows, large illumination changes etc) , the robot is expected fail in executing desired tasks.

We would therefore like to consider deep learning based control strategies wherein the robot is capable of implicitly extracting out relevant features from the environment using a trained Neural Network, thereby allowing it to take correct decisions. However, since Deep learning is a data driven paradigm, samples are needed to be collected for training the neural network.

Learning by imitation

In this approach, we would like the robot to mimic the behaviour of a human teacher when it encounters conditions similar to those presented in the demonstration. The video below shows a subset of the 30 training sessions that were collected by manually maneuvering the robot towards the lego block. The action(left, right or straight) taken by the human against each frame corresponds to the "correct" label. The CNN classifer was trained to decide on which of the 3 actions is to be taken with the segmented mask as the input.

Since, 30 training sessions were not enough, additional segmented frames were artifically generated with the correct labels and included in the training set. Some examples of which are shown below

Video below demonstrates that the robot is able to follow the lego block correctly albeit rather slowly in some instances. The colour in the action video shows the action being taken(white for left, gray for straight and black for right).

However, with obstacles also in the frame, the robot has a hard time deciding the trajectory to take. Towards the end of the video below, the robot oscillates for a long time between the lego and obstacle and slightly moves forward in the direction of the lego.

Reinforcement Learning

To improve the imitation model we need much more training sessions. However, collecting and recording supervised sessions is a tedious task. Moreover, by learning through imitation, the model can at best perform as good as the human teacher. Therefore, if the teacher makes mistakes during the demonstration, the model might learn to replicate them at test time. We would therefore like to somehow let the robot learn without explicit human supervision.

One approach is to formulate the model and the environment as a reinforcment learning based problem, wherein the model achieves rewards for executing correct actions and is penalized for incorrect decisions. If left alone, for a large number of episodes with this formulation the robot should be able to learn a good control policy without supervision.

The model can be intialized with the weights it learnt through imitation. However, we still need a a great deal of episodes, which may take days or even weeks of training. Therefore, a virtual simulator was designed resmebling the environment as close as possible. The reinfocement learning based policy could thus be trained on this simulator and then transferred to the actual environment. The number training episodes in the real environment should be reduced after this model transfer. An example of the simulator at the is shown below.

A reward of +1 is given if the number of lego pixels increase, -2 for if the number they remain the same. This is necessary to allow for the agent to get to its desired goal as fast as possible. The episode ends if the either the obstacle or the lego brick enters the center of the bottom of the screen in which case the agent receives either a high negative or a high positive reward respectively. Due to lack of time this solution was not throroughly tested and therefore could not be implemented on the robot.