Nearly done! Achievement 5 and 6

Gripping and localizing an unknown but specific brick



Achievements 5 and 6 were quite similar to achievement 4 which we have already solved gracefully. The only thing that changed is that the robot has to search for a brick that the robot has not been trained on, and which can be chosen by the jury, and pick it up. So the only thing which had to add was the phase where we tell him which brick he should actually look for.

The whole task can be divided into four phases:

  1. Telling the robot which brick to pick up
  2. Exploring space looking for the brick
  3. Driving towards the brick and picking it up
  4. Putting brick in box

Additionally we are working on refining our previous solutions and slowly getting to a final state of the robot. After all there are only two days of work left. Minor (or rather major as it turns out) tasks are:

  1. Getting our code to run on the Jetson
  2. Teaching the robot to find boxes and distinguish between boxes with markers
  3. Add second camera
  4. Speeding up the Segmentation network
  5. Optimizing exploration

Solving task 5

There are different options on how to tell the robot which brick to pick up. One is to let the robot scan the environmet by turning 360 degrees and taking images of the bricks, then let the network segment all bricks it could see and display them on the screen. A juror then can decide which brick she wants the robot to be picked up and click on it on the screen. Our latent networks then computes the latents for this brick, turns back to the angle it has seen this brick and by computing the distance of the seen brick to the bricks it currently sees in the camera the robot drives towards the brick and picks it up.

The second approach, the one we have actually used is a bit different. Before the robot starts scanning the environment we let it create a representation of the brick by showing it to him from different angles. Then a random sample of images is taken and segmented. For the images where we actually get a segmentation, i.e. for which there is a brick in the image we calculate the latent variables which are basically a condensed digital representation of the brick consisting of 32 float numbers. The whole process takes aproximately two minutes. Only the the robot starts scanning the evironment in order to look for the brick it has previously seen and the same code as in achievement 3 and 4 comes into place.

In theory achievement 5 is therefore just an extension of task 4, with scanning a brick at the beginning. In practice though we changed quite a lot which is more interesting than the scanning part.

Running code on the Jetson

This week we finally did the big move from our desktop workstation to the jetson as the true brain of our robot. Until now we confined ourselves to run all our code from the PC as our image segmentation network requires quite a bit of RAM and computational power. While we are working hard at reducing the network in size and so we can run it at reasonable FPS on the Jetson, we have some time ago rewrote our network code to have an unlimited number of slave devices which can compute things as directed by the master device (=now the Jetson). So, for example, if we want to segment an image, we send a compressed version thereof via network to our network_segmentation_slave, which either runs on the second Jetson, our workstation, or the downstairs DGX cluster.

Similarly to our messaging protocol that we introduced before to communicate with our EV3, we encode all our messages (python dicts of data and arguments) as JSON. This requires a few tricks and tweaks to make things run smoothly. For example, you want to serialize your numpy.ndarrays as string, without loss of information such as dtype or shape. Numpy offers some functions, but not all applicable or have some downside. Choose carefully! Just in case, its easy to do that serialization yourself. Below you can find two easy code snippets to serialze ndarrays with all information you need to restore them correctly, independent of numpy version.

def encode_numpy_array(arr):
   return json.dumps([str(arr.dtype), base64.b64encode(arr).decode('utf-8'), arr.shape])

def decode_numpy_array(arr_string):
   # Decode JSON
   enc = json.loads(arr_string)

   # Convert string to numpy datatype
   dataType = np.dtype(enc[0])

   # decode the base encoded numpy string (will be a 1d array)
   dataArray = np.frombuffer(base64.decodestring(enc[1].encode('utf-8')), dataType)

   # check if we gave a shape (not necessary with the func above)
   if len(enc) == 3:
       # reshape and return result
       return dataArray.reshape(enc[2])

   return dataArray

One painful thing we also often encountered, was making the IP of our master (were the MQTT broker service runs), known to the clients.

For this we wrote some simple routines, which perform a UDP broadcast on a specified network device. This "important", once we have many different connected slaves, and it becomes to cumbersome to many write down each IP address. Its not the full code, but the below should give you a general idea on how to roll such a simple protocol (but note...there are some great libraries for this as well!)

import netifaces

def find_interface_ip(interface):
   # finds the correct ip address (not of specified network 
   # device/interface
   return netifaces.ifaddresses(interface)[netifaces.AF_INET][0]['addr']

def master_announcement(expected_slaves, interface, message_delay=5, repeats=12):
   # this will broadcast the ip address of our master on the network of
   # specified network interface

   # Detect IP of specificed Internet Device
   server_ip = find_interface_ip(interface)

   # udp sockets for the broadcasting
   s = socket(AF_INET, SOCK_DGRAM)  # create UDP socket for broadcast
   s.bind((server_ip, 0))
   s.setsockopt(SOL_SOCKET, SO_BROADCAST, 1)  # this is a broadcast socket

   # udp socket to listen to clients acknowleding the master
   r = socket(AF_INET, SOCK_DGRAM)  # create UDP socket
   r.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
   r.bind((server_ip, config.SERVICE_ACK_PORT))

   # Add slaves who give a call back to this set and compare later against 
   # expected list of slaves
   ack_slaves = set()

   for repeat in range(repeats):
       data = config.SERVICE_BROADCAST_MAGIC + server_ip + ":" + 
                ('<broadcast>', config.SERVICE_BROADCAST_PORT))
       logger.debug("Broadcasting own IP+slaves: %s (try=%d)", data, repeat)

       # Check here, to ensure that we send at least once the complete list
       # of slaves
       if ack_slaves == set(expected_slaves):
           logger.debug("Found all expected slaves")
           return True
           logger.debug("Not all slaves present yet: %s (expected: %s)", 
                    ack_slaves, expected_slaves)

       # Wait for messages and timeout eventually
       ready =[r], [], [], message_delay)
       if ready[0]:
           data = r.recv(4096)
           if data.startswith(config.SERVICE_BROADCAST_MAGIC.encode('utf-8')):
               slave = data[len(config.SERVICE_BROADCAST_MAGIC):].decode('utf-8')
               if slave not in ack_slaves:
                   ack_slaves.add(slave) # add potential new slave to set
                   logger.debug("Found new slave: %s", slave)
           logger.debug("time out for receiving slave acknowledgements")

   logger.debug("Not all slaves registered")
   return False # not all slaves connected in time

def find_master(client_id, ack_repeats=5):
   # this tool will run the client side to detect the master and then send 
   # its own name back to the master.

   s = socket(AF_INET, SOCK_DGRAM)  # create UDP socket
   s.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
   s.bind(('', config.SERVICE_BROADCAST_PORT))

   logger.debug("Waiting for broadcast from master...")
   while True:
       data, addr = s.recvfrom(4096)  # wait for a packet with correct magic
       if data.startswith(config.SERVICE_BROADCAST_MAGIC.encode('utf-8')):
           master_ip, ack_slaves = 
           logger.debug("Found master at IP: %s", master_ip)

           if client_id in json.loads(ack_slaves):
               logger.debug("%s: already registered", client_id)

           logger.debug("Sending acknowledgement")
           ack_msg = config.SERVICE_BROADCAST_MAGIC + client_id
           s.sendto(ack_msg.encode('utf-8'), (master_ip, config.SERVICE_ACK_PORT))

   return master_ip

Learning to distinguish boxes

Though for this task we just have one box to put bricks in, we also implemented recognition for more than one marker. Key to this is the MarkerTracker which when updated with an image gives back a list of marker_ids.

class MarkerTracker()
    self.coords = []    # Coordinates of the markers
    self.marker_ids = []    # IDs of the markers

The main feature of the Marker Tracker is that it holds information about markers the robot sees at the moment. We may check whether the robot sees a particular marker by calling this function and checking whether the desired marker is among the returned IDs.

def update(self, image): list<int> marker_ids
    self.coords = get_coords(image)
    self.marker_ids = get_marker_ids(image)
    return marker_ids

To get the marker IDs we use the openCV library arcuno which comes with its own markers (see image below) and corresponding IDs, as well as several functions to detect them. The library also allows the computing of a relative pose between camera and marker.

However, we did not find the libary to be extremely stable. That is why we use as little functionality as possible, i.e. marker detection including marker IDs and marker coordinates in the image.

Thus far we have hard-coded the marker ID tof the box the robot should put the brick in. In the future we plan to assign IDs to bricks corresponding to the cluster they have been assigned to but this is for the achievement 8.

Adding second camera

Why do we want to add a second camera? As you may have noticed our primary camera has a very steap angle towards the ground so that it can see bricks in front of it but it is also crucial that it can see the brick between its grippers to estimate the optimal gripping position. The below image shows the angle the camera is looking at with a pink arrow. For objects especially the boxes with large markers it is however very useful to have a camera pointing in front so that farther away objects are in the range of view as well.

The Jetson comes already with its own in-build camera and all the interfaces have already been created. It is facing to the front as you can see in the image (green arrow). So it is ready to go and we only have to add logic.

The basic idea is to use the front facing camera to detect the boxes and the rough direction in which there are bricks. Then when we are actually looking for a specific brick the down facing camera is used to (1) find the brick, (2) move the robot towards the brick and (3) make sure that the brick is in between the gripper for the robot to grasp.

In practice we used the frot facing camera only you detect the boxes and move towards them. As soon as it is close enough and the camera loses track of the marker it switches back to the other one which now has a perfect viewing angle to detect the marker on the box. To scan the brick in the first phase and the environment in the second only the down facing camera is used.

Speeding up the Segmentation Network

Another approach we took in order to get the code running at a faster frame rate and working on the Jetson was to retrain our neural networks with smaller network sizes. We were originally using two neural networks, one for image segmentation, and one for calculating the latent variables that are used to cluster bricks. Each of these networks was originally a 38 layer, wide ResNet with around 125 million weight values. These are huge networks. With these networks we were able to run the code to segment an image and extract the latent representation for each brick at around 4 frames per second. We retrained each of these networks by taking only the first 5 ResNet layers and then appending the output layer for each network, using the pre-trained weights for the first five layers that we had for the large networks and retraining. The new networks had around 1.5 million parameters each, which was 80 times smaller than the original and each could be run at around five times faster, resulting in being able to get segmentations and latent representations for clustering at around 20 frames per second.

Optimizing exploration

What we mean by that is to let the robot not only move 380 degrees around his axis but to also let him drive around in order to scan the environment. We have not advanced on this so far because there are several problems arising and how we seek to solve them:

  1. Cables: The robot is not able to move freely with cables running between it and the computer. Moving the code to the jetson makes things a bit better but we still have the power and network cable which are at least long.
  2. Which brick is which? If you have a robot freely moving and taking images without having any way of tracking where the robot is relative to the world you also cannot know where bricks are. Therefore there is no way of knowing which bricks are the same from different angles and if the images have not been taken consecutively.
  3. Making sure the robot stays in the region where there are acual bricks. Again without SLAM there is no way to know where the robot is.

Even though we abolished the idea of using slam for mapping we still wanted to have some kind of exploration. For this we use the markers: The markers are placed at the edges of the square the robot should move in and it simply should move from marker to marker.