The problem of localization of the robot and mapping the environment around is known not to be trivial and, as you will learn at the end of this post, has definitely been a hard task for us.
Simultaneous localization and mapping (SLAM) is a method to estimate the position of a robot and the map of the environment at the same time from this position, thus resulting in a chicken-egg problem. This is difficult because a map is needed for good localization and a good position estimate is needed for mapping. And it became more difficult for us, as we didn't use any depth sensors like LIDAR, etc, and used only an RGB camera for SLAM. Nevertheless, we tried out multiple methods. The summary of our efforts below.
Method 1) ORB-SLAM - Monocular SLAM
In the first attempt, we used ORB-SLAM2, which is an open source real-time SLAM library with visual vocabulary trained on KITTI, TUM and EuRoC dataset which computes the camera trajectory and a sparse 3D reconstruction of the environment.
The execution flow of the algorithm is shown below.
The method uses ORB (Oriented FAST and rotated BRIEF) features to find and track landmarks from the image sequences.
ORB is basically a fusion of FAST and BRIEF descriptor with many modifications to enhance the performance. First, it uses FAST to find keypoints, then applies Harris corner measure to find top N points among them. It also uses pyramid to produce multiscale-features. For rotation invariance, it computes the intensity weighted centroid of the patch with the located corner at center. The direction of the vector from this corner point to centroid gives the orientation. To improve the rotation invariance, moments are computed with x and y which should be in a circular region of radius r, where r is the size of the patch.
Once downside we discovered when using it, was that the features were quite instable (bad camera, noisy images, very dynamic environment) and thus the system often lost track and could not always recover reliably. Also, systems like ORB or LSD-SLAM tend to fail when performing pure rotations and no translation. And we have a differential drive robot, which really likes to turn on the spot. Of course, our camera is not the center of rotation, but close enough to make it more difficult to estimate our pose reliably.
Method 2) Deep Neural Network
In the next step we tried using a pre-trained model from Deeper Depth Prediction with Fully Convolutional Residual Networks 3 (FCRN-DepthPrediction4) to predict the depth of each pixel from an RGB image. Below is the architecture of the network.
This network uses Fully Convoluted layers with residual connections to predict the depth of each pixel of the image given as an input.
The NYU Depth v2 dataset, that the network wast pre-trained on, consists of images from 464 different scenes, and the network was trained on 95k pairs of RGB-D images extracted from 12k unique and random images of the scenes. Thus the network works well in the indoor environment. The depth predicted from the network, which was combined with ORB-SLAM RGB-D SLAM to predict map (3d reconstruction) and pose, gave rise to stable features for the pose estimation and could help us with avoiding obstacles. The image below showes the information flow in the network.
But all these pros came with a cost. The network turned out too big and too slow. Especially, given that we were already using a big segmentation network on Jetson, adding another network to run on the same platform was not feasible.
Method 3) ARUCO marker-based SLAM (with known marker location)
Since our challenge guidelines give a pretty clear specifications of the minimum capabilities of our robot, we decided to also design a localization system which makes full use of these details. Since its a 1.5mx1.5m work space which are robot will live in, we decided to place boxes with ARUCO markers on them as waypoints/outer corners of our workspace.
Having a precise knowledge of the location of each marker, we can define a basic map for localization without the mapping part, recuding our problem to a pure localization (no longer SLAM!)
These 8 QR-like markers can be used to compute relative distances between the camera and the respective marker, and use particle filters for the estimation of the overall pose and map. An example of the setup below.
However, there are some pitfalls here which made this approach less robust than originally assumed. Many small inaccuracies (marker placement, slight rotation of the marker plane, very noise pose estimation, bad camera image quality which required lots of postprocessing, subpixel approximatione etc.) quickly add up, making it difficult to get a consistent pose estimate over longer time.
Method 4) ARCUO marker-based, graph-SLAM (with unknown marker location)
With the problems from 3) and the advantages from 1) in mind, we quickly coded up a graph-based SLAM using markers as only features. For a good overview, we suggest this reading by Thrun and Montemerlo (or if you have access to it, Probabilistic Robotics is great resource). A nice and soft-intro is also the ICRA workshop presentation by Stachniss.
The general idea of graph-SLAM to represent poses and landmarks (typically sparse features like SIFT/SURF, or the corners of our markers) as nodes, where edges represent a spatial constraints imposed on these during mapping. Typically, these constraints are quite uncertain and change over time. Like all SLAM systems we have to deal with noisy measurements, drift, etc., meaning eventually we will measure the same landmark at different locations/estimate a wrong camera pose.
Once the graph has been built, we can perform bundle adjustment. Basically, we want to find a node configuration, which minimizes the overall error of all constraints across the graph (fancy terms, but its simply a non-linear least squares estimation problem).
We were able to get some decent results for the pre-bundle adjustment, but it would still lose track after some time and jerky movements. In order to integrate bundle adjustment we have two options: on-line bundle adjustment, or bundle adjustment once after mapping the whole environment and then perform only localization. Since we have a very sparse graph as we only have 4 corners per marker, with 8 markers total and typically only 2 or 3 markers per view/frame. Compared to a sparse graph-SLAM using 2000 SIFT feature per frame, maybe IMU measurements or other modalities, we can perform a full online-bundle adjustment near instantenously.
However, we ended up not fully completing this project due to a feature freeze deadline. With the last days of the competition approaching rapidly, we made the decision to commit to a feature freeze several days before the final competition day. Rather a suboptimal, but working system, then risking with high chance total failure.
Also, we came up with a smart system to interact in our workspace without a known marker (just based on markers and one constraint on their placement) and still cope with all sorting problems. Path planning won't be as efficient, but given the scope of the project, more than justified.
While we opted out of SLAM, we were still able to gather some interesting insights of applying SLAM to our problems. To summarize some of our findings:
ORB-SLAM - Monocular SLAM
- Pros - This method uses is purely based on RGB values so is comparatively faster.
- Cons - Due to low resolution camera the landmark points used by the algorithm are unstable. With changing light conditions, due to camera rotation, the landmarks are lost frequently.
Deep Neural Network
- Pros- Prediction of depth is also useful for obstacle avoidance. Better performance than previous method.
- Cons - The network is big and the prediction is very slow.
ARUCO marker-based SLAM (with known marker location)
- Pros- Easy to use and fast for real time
- Cons- Depends on a lot of factors, like camera calibration, marker position and their orientation, etc , that should be precise, for this method to work properly.
ARCUO marker-based, graph-SLAM (with unknown marker location)
- Pros- Reasonbly easy to implement Very fast BA due to constrained problem complexity
- Cons- Requires good camera calibration Regular bundle adjustment and loop closure to be robust (otherwise switch to localization only)
That's all for now, but stay tuned for more blog posts!
Yours truly - Great Dolphins Kaboom