Vision Transformers for End-to-End Vision-Based Quadrotor Obstacle Avoidance


Anish Bhattacharya*,  Nishanth Rao*,  Dhruv Parikh*,  Pratik KunapuliYuwei WuYuezhan Tao,

Nikolai MatniVijay Kumar

GRASP, University of Pennsylvania

As end-to-end learning-based approaches for fast, vision-based quadrotor flight become more prevalent, we study the capabilities of a vision transformer (ViT) architecture that maps depth images to velocity commands both in simulation and real-world. We find that when combined with LSTM layers, a ViT outperforms other state-of-the-art architectures of choice (UNet, LSTM-only) as forward velocity increases, as well as outperforming modular approaches in scalability to various speeds and low collision rates. We demonstrate zero-shot, sim-to-real transfer with high-speed and multi-obstacle avoidance at speeds reaching 7m/s.

Our ViT+LSTM model generalizes well to various simulation environments

Our ViT+LSTM model transfers zero-shot to real-world

4 m/s

7 m/s

BibTeX

@inproceedings{bhattacharya2025vision,

  title={Vision transformers for end-to-end vision-based quadrotor obstacle avoidance},

  author={Bhattacharya, Anish and Rao, Nishanth and Parikh, Dhruv and Kunapuli, Pratik and Wu, Yuwei and Tao, Yuezhan and Matni, Nikolai and Kumar, Vijay},

  booktitle={2025 IEEE International Conference on Robotics and Automation (ICRA)},

  year={2025},

  organization={IEEE}

}