Monocular Event-Based Vision for Obstacle Avoidance with a Quadrotor

Theme: Bridging the sim-to-real gap of event-based perception by leveraging cross-modal vision post-training

Abstract

We present the first static-obstacle avoidance method for quadrotors using just an onboard, monocular event camera. Quadrotors are capable of fast and agile flight in cluttered environments when piloted manually, but vision-based autonomous flight in unknown environments is difficult in part due to the sensor limitations of traditional onboard cameras. Event cameras, however, promise nearly zero motion blur and high dynamic range, but produce a large volume of events under significant ego-motion and further lack a continuous-time sensor model in simulation, making direct sim-to-real transfer not possible.

By leveraging depth prediction as a pretext task in our learning framework, we can pre-train a reactive obstacle avoidance events-to-control policy with approximated, simulated events and then fine-tune the perception component with limited events-and-depth real-world data to achieve obstacle avoidance in indoor and outdoor settings.

We demonstrate this across two quadrotor-event camera platforms in multiple settings and find, contrary to traditional vision-based works, that low speeds (1m/s) make the task harder and more prone to collisions, while high speeds (5m/s) result in better event-based depth estimation and avoidance. We also find that success rates in outdoor scenes can be significantly higher than in certain indoor scenes.

Why?

Event cameras convey a near-continuous stream of perception data with microsecond latency and high dynamic range, potentially allowing robots to interpret their world at faster speeds and in variable lighting, including in near-dark scenes where traditional cameras fail.

A typical simulation training scheme may run reinforcement learning or imitate a privileged expert to produce actions from vision input. However, in the event vision case, we do not have a continuous-time event camera available in simulation, which makes direct sim-to-real impossible.

Method

To alleviate this and enable simulation pre-training with events-to-control policies, we gather photorealistic ego-video of a privileged expert flying hundreds of trials in a simulated forest environment. Then, we compute approximated event streams from each rollout via a learned events approximator (Vid2E), and time-synchronize event batches with the gathered depth images. We train a perception network D(θ) to predict depth images from a binary event mask representation of the event stream, jointly with an avoidance network V(Φ) to predict the expert's velocity commands from the depth image predictions. When we transfer to real-world, we gather time-synchronized and calibrated events-and-depth, and fine-tune D(θ) with a few examples. With this fine-tuned perception network, and simulation-trained avoidance network, we can deploy on a robot for obstacle avoidance.

Key findings

We find that joint training may allow V(Φ) to better predict avoidance commands from the non-geometrically-consistent predictions from D(θ) (below, left). We also find that contrary to prior traditional vision-based navigation works, success rates in real-world trials improve drastically as forward velocity increases. This was repeated across multiple quadrotor platforms with different event cameras, different event camera biases, and various environments.