Vision Transformers for End-to-End Vision-Based Quadrotor Obstacle Avoidance

Anish Bhattacharya*, Nishanth Rao*, Dhruv Parikh*, Pratik Kunapuli, Yuwei Wu, Yuezhan Tao, Nikolai Matni, Vijay Kumar

GRASP, University of Pennsylvania

More updates to come!

As end-to-end learning-based approaches for fast, vision-based quadrotor flight become more prevalent, we study the capabilities of a vision transformer (ViT) architecture that maps depth images to velocity commands both in simulation and real-world. We find that when combined with LSTM layers, a ViT outperforms other state-of-the-art architectures of choice (UNet, LSTM-only) as forward velocity increases. We also find that this model can zero-shot transfer to high-speed real-world dodging in indoor scenes.

Code

Paper

Data

The ViT+LSTM model generalizes to various simulation environments

(large gifs may take time to load 🙂)

The ViT+LSTM model transfers zero-shot to real-world

(real speed shown)

iros24-corrected-enc.mp4