Vision Transformers for End-to-End Vision-Based Quadrotor Obstacle Avoidance
Anish Bhattacharya*, Nishanth Rao*, Dhruv Parikh*, Pratik Kunapuli, Yuwei Wu, Yuezhan Tao, Nikolai Matni, Vijay Kumar
GRASP, University of Pennsylvania
More updates to come!
As end-to-end learning-based approaches for fast, vision-based quadrotor flight become more prevalent, we study the capabilities of a vision transformer (ViT) architecture that maps depth images to velocity commands both in simulation and real-world. We find that when combined with LSTM layers, a ViT outperforms other state-of-the-art architectures of choice (UNet, LSTM-only) as forward velocity increases. We also find that this model can zero-shot transfer to high-speed real-world dodging in indoor scenes.
The ViT+LSTM model generalizes to various simulation environments
(large gifs may take time to load 🙂)
The ViT+LSTM model transfers zero-shot to real-world
(real speed shown)