anishbhattacharya.com - ViTs for quadrotor dodging

Vision Transformers for End-to-End Vision-Based

Quadrotor Obstacle Avoidance

Anish Bhattacharya*, Nishanth Rao*, Dhruv Parikh*, Pratik Kunapuli, Yuwei Wu, Yuezhan Tao,

Nikolai Matni, Vijay Kumar

GRASP, University of Pennsylvania

As end-to-end learning-based approaches for fast, vision-based quadrotor flight become more prevalent, we study the capabilities of a vision transformer (ViT) architecture that maps depth images to velocity commands both in simulation and real-world. We find that when combined with LSTM layers, a ViT outperforms other state-of-the-art architectures of choice (UNet, LSTM-only) as forward velocity increases, as well as outperforming modular approaches in scalability to various speeds and low collision rates. We demonstrate zero-shot, sim-to-real transfer with high-speed and multi-obstacle avoidance at speeds reaching 7m/s.

Our ViT+LSTM model generalizes well to various simulation environments

Our ViT+LSTM model transfers zero-shot to real-world

4 m/s

7 m/s

BibTeX

@inproceedings{bhattacharya2025vision,

title={Vision transformers for end-to-end vision-based quadrotor obstacle avoidance},

author={Bhattacharya, Anish and Rao, Nishanth and Parikh, Dhruv and Kunapuli, Pratik and Wu, Yuwei and Tao, Yuezhan and Matni, Nikolai and Kumar, Vijay},

booktitle={2025 IEEE International Conference on Robotics and Automation (ICRA)},

year={2025},

organization={IEEE}

}

Google Sites

Report abuse