Vision Transformers for End-to-End Vision-Based Quadrotor Obstacle Avoidance

Anish Bhattacharya*, Nishanth Rao*, Dhruv Parikh*, Pratik Kunapuli, Nikolai Matni, Vijay Kumar

GRASP, University of Pennsylvania

As end-to-end learning-based approaches for fast, vision-based quadrotor flight become more prevalent, we study the capabilities of a vision transformer (ViT) architecture that maps depth images to velocity commands both in simulation and real-world. We find that when combined with LSTM layers, a ViT outperforms other state-of-the-art architectures of choice (UNet, LSTM-only) as forward velocity increases. We also find that this model can zero-shot transfer to high-speed real-world dodging in indoor scenes.





The ViT+LSTM model generalizes to various simulation environments

(large gifs may take time to load 🙂)

The ViT+LSTM model transfers zero-shot to real-world