In recent years, visual Simultaneous Localization and Mapping (SLAM) have gained significant attention and found wide-ranging applications in diverse scenarios. Recent advances in computer vision and deep learning also enrich visual SLAM capabilities in scene understanding and la
...
In recent years, visual Simultaneous Localization and Mapping (SLAM) have gained significant attention and found wide-ranging applications in diverse scenarios. Recent advances in computer vision and deep learning also enrich visual SLAM capabilities in scene understanding and large-scale operation. However, despite remarkable performance in these fields, most visual SLAM frameworks are designed with the static world assumption. Thus, they often confront challenges in dynamic environments, manifesting reduced localization accuracy, tracking failures, and restricted generalization.
To investigate the impact of moving objects in dynamic indoor environments, we first benchmark representative visual (dynamic) SLAM approaches, complemented by robustness assessments for preliminary insights. During this process, we adopt challenging sequences from GRADE, an ideal platform for simulating dynamic indoor scenes. Notably, the mainstream of dynamic SLAM methods employs detection or segmentation techniques as solutions. To explore the correlation between detector accuracy and overall SLAM performance, we integrate a series of trained YOLOv5 and Mask R-CNN models, each with varying accuracy levels, into dynamic SLAM systems. Subsequently, we evaluate these configurations on the TUM RGB-D sequences. Contrary to common intuition, the experiments indicate that more accurate object detectors do not necessarily lead to improved visual SLAM performance. This benchmarking process also illuminates several inherent limitations of current dynamic SLAM techniques, underscoring the imperative for further advancements.
Building upon these insights, we introduce DynaPix SLAM, an innovative visual SLAM system for dynamic indoor environments, where participation of visual cues (e.g., features) is weighted based on per-pixel motion probability values. Our approach consists of a semantic-free pixel-wise motion estimation module and an improved pose optimization process. In the first stage, our motion probability estimator employs a novel static background differencing method on both images and optical flows to identify moving regions. These probabilities are then incorporated into the map point selection and weighted bundle adjustment for backend optimization. We evaluate our DynaPix SLAM and its variant, DynaPix-D, in comparison with ORB-SLAM2 and DynaSLAM. These assessments are performed on both TUM RGB-D and GRADE sequences, with additional tests on the static versions of the GRADE ones. The results demonstrate that DynaPix SLAM consistently outperforms the other methods, showcasing reduced localization errors and longer tracking durations across various scenarios.