360DVO: Deep Visual Odometry for Monocular 360-Degree Camera
Abstract
In this paper, we present 360DVO, the first deep learning-based OVO framework. Our approach introduces a distortion-aware spherical feature extractor (DAS-Feat) that adaptively learns distortion-resistant features from 360-degree images. These sparse feature patches are then used to establish constraints for effective pose estimation within a novel omnidirectional differentiable bundle adjustment (ODBA) module. To facilitate evaluation in realistic settings, we also contribute a new real-world OVO benchmark. Extensive experiments on this benchmark and public synthetic datasets (TartanAir V2 and 360VO) demonstrate that 360DVO surpasses state-of-the-art baselines (including 360VO and OpenVSLAM), improving robustness by 50% and accuracy by 37.5%.
Pipeline
Our method takes sequential 360-degree RGB frames as input and extracts matching features and context features using our proposed DAS-Feat module on each of them. In DAS-Feat, the key component SphereResNet extracts distortion-resistant features, allowing patches to be cropped without deformation. After patchifying the matching features around their gradient maxima, we compute the correlation of patch features and context features and estimate optical flow through a recurrent network. In the ODBA module, the pose and depth of current frame are jointly optimized by minimizing the distance between predicted patch (from optical flow) and reprojected patch on the adjacent frame.
360DVO Dataset
We introduce 360DVO Dataset, a large-scale real-world OVO benchmark emphasizing practical challenges across diverse environments and motions. It includes 20 sequences (~1k frames each) and all images are standardized to 3840x1920 at 10 FPS. Pseudo ground truth is reconstructed via SfM software Agisoft Metashape. To facilitate rigorous evaluation, we partition the dataset into Easy and Hard subsets (10 each) by trajectory complexity and scene dynamics, from linear, static cases to aggressive rotations, lighting shifts, and dynamic occlusions.
Easy-00: Bridge Night.
Easy-01: Canyon Line.
Easy-02: City Driving.
Easy-03: Downhill Biking.
Easy-04: Hongkong Central.
Easy-05: Hongkong Wanchai.
Easy-06: Mountains.
Easy-07: Shanghai Driving.
Easy-08: Snowmobile.
Easy-09: Tokyo Citywalk.
Hard-00: Canyon Loop.
Hard-01: Dragon Boat.
Hard-02: Drone Racetrack.
Hard-03: Field.
Hard-04: Grove.
Hard-05: Indoor RC Car.
Hard-06: London Bridge.
Hard-07: Ridge To Lake.
Hard-08: Snowy Mountain Road.
Hard-09: Wingsuit.
Experiments
Table I: Quantitative comparison of trajectory accuracy (ATE/RPE(t)/RPE(r)) and tracking success rate (%) on the 360DVO dataset (Easy and Hard). “-” indicates failure. Best result per sequence is in red, second-best result in blue. 360DVO (fast) denotes the modification of using lower resolution images as input while maintaining sparser sampling patches. All results are shown with 2-decimal precision for clarity.
Figure I: Boxplot results of OVO methods on the 360DVO dataset. Our 360DVO runs stably with lowest variations.
Figure II: Trajectories Comparison on the 360DVO dataset in 3D space, with position variations along the X, Y, and Z axes plotted over all frames. The ground truth is shown in black dashed lines, 360DVO results in red solid lines, and OpenVSLAM results in blue solid lines.
Video Presentation
Demo Videos
Easy-00: Bridge Night.
Hard-02: Drone Racetrack.
The performance on long sequences.
BibTeX
@ARTICLE{
11358682,
author={Guo, Xiaopeng and Xu, Yinzhe and Huang, Huajian and Yeung, Sai-Kit},
journal={IEEE Robotics and Automation Letters},
title={360DVO: Deep Visual Odometry for Monocular 360-Degree Camera},
year={2026},
volume={11},
number={3},
pages={3079-3086},
keywords={Feature extraction;Cameras;Nonlinear distortion;Convolution;Kernel;Bundle adjustment;Visual odometry;Accuracy;Benchmark testing;Robustness;Visual odometry;omnidirectional vision},
doi={10.1109/LRA.2026.3655280}
}