Vision Transformers for End-to-End Vision-Based Quadrotor Obstacle Avoidance
Anish Bhattacharya*, Nishanth Rao*, Dhruv Parikh*, Pratik Kunapuli, Yuwei Wu, Yuezhan Tao,
Vision Transformers for End-to-End Vision-Based Quadrotor Obstacle Avoidance
Anish Bhattacharya*, Nishanth Rao*, Dhruv Parikh*, Pratik Kunapuli, Yuwei Wu, Yuezhan Tao,
As end-to-end learning-based approaches for fast, vision-based quadrotor flight become more prevalent, we study the capabilities of a vision transformer (ViT) architecture that maps depth images to velocity commands both in simulation and real-world. We find that when combined with LSTM layers, a ViT outperforms other state-of-the-art architectures of choice (UNet, LSTM-only) as forward velocity increases, as well as outperforming modular approaches in scalability to various speeds and low collision rates. We demonstrate zero-shot, sim-to-real transfer with high-speed and multi-obstacle avoidance at speeds reaching 7m/s.
Our ViT+LSTM model generalizes well to various simulation environments
Our ViT+LSTM model transfers zero-shot to real-world
4 m/s
7 m/s
@inproceedings{bhattacharya2025vision,
title={Vision transformers for end-to-end vision-based quadrotor obstacle avoidance},
author={Bhattacharya, Anish and Rao, Nishanth and Parikh, Dhruv and Kunapuli, Pratik and Wu, Yuwei and Tao, Yuezhan and Matni, Nikolai and Kumar, Vijay},
booktitle={2025 IEEE International Conference on Robotics and Automation (ICRA)},
year={2025},
organization={IEEE}
}