End-to-End Egospheric Spatial Memory
ICLR 2021

1Dyson Robotics Lab, 2Department of Computing
Imperial College London


Egospheric Spatial Memory (ESM) encodes the memory in an ego-sphere around the agent, enabling expressive 3D representations. ESM can be trained end-to-end via either imitation or reinforcement learning, and improves both training efficiency and final performance against other memory baselines on both drone and manipulator visuomotor control tasks.

ESM is a parameter-free module, and relies on forward warp reprojections for updating the memory. The egospheric memory at time t-1 is combined with the new observations at time t, to produce the new egospheric memory at time t.

Off-the-Shelf Mapping

Assuming access to a stream of depth and color images, and access to camera pose estimates, the ESM module can be used for off-the-shelf real-time egocentric mapping, with color values projected into memory.

Neural Network Integration

The real stength of ESM arises when training end-to-end as part of a wider neural network. The ESM module can be combined with both pre-module convolutions and post-module convolutions, for solving a variety of downstream tasks. The pre-module convolutions enable learnt features to be stored in the module, optimized for any downstream task. The post-module convolutions can then use this stored representation to execute the task. We refer to networks with both pre and post module convolutions as Egospheric Spatial Memory Networks (ESMN). For some tasks, the post-module convolutions are sufficient, with color values projected into the memory. We refer to these networks as ESMN-RGB.

We compare against other less structured memory baselines, such as long short term memory (LSTM), and neural turing machines (NTM). The baseline methods are given access to all the same information, including ground truth poses.

Image to Action Learning

We test ESM in a variety of image-to-action reacher tasks. We test for 6DOF control of both drones and robot manipulators, using either onboard or freely moving cameras, with networks conditioned on target shape or target color, trained using either imitation learning or reinforcement learning. In all cases, we find that ESMN and ESMN-RGB outperform less structured memory baselines, such as long short term memory, and neural turing machines.

The explicit geometry also enables seamless integration with other non-learnt control strategies, such as local obstacle avoidance. The egocentric geometry leads to more robust avoidance than is possible using individual depth frames.

Object Segmentation

We also apply the module to object segmentation. We find that both the pre-module and post-module convoltuions are required to achieve the best performance, with the ESMN architecture. ESMN combines the benefits of both image level and map level inference, outperforming baselines which either fuse monocular predictions (Mono) or make predictions directly from an RGB map (ESMN-RGB).