SyNet: A Synergistic Network for 3D Object Detection through Geometric-Semantic-based Multi-Interaction Fusion

Abstract

Driven by rising demands in autonomous driving, robotics, etc., 3D object detection has recently achieved great advancement by fusing optical images and LiDAR point data. On the other hand, most existing optical-LiDAR fusion methods straightly overlay RGB images and point clouds without adequately exploiting the synergy between them, leading to suboptimal fusion and 3D detection performance. Additionally, they often suffer from limited localization accuracy without proper balancing of global and local object information. To address this issue, we design a synergistic network (SyNet) that fuses geometric information, semantic information, as well as global and local information of objects for robust and accurate 3D detection. The SyNet captures synergies between optical images and LiDAR point clouds from three perspectives. The first is geometric, which derives high-quality depth by projecting point clouds onto multi-view images, enriching optical RGB images with 3D spatial information for a more accurate interpretation of image semantics. The second is semantic, which voxelizes point clouds and establishes correspondences between the derived voxels and image pixels, enriching 3D point clouds with semantic information for more accurate 3D detection. The third is balancing local and global object information, which introduces deformable self-attention and cross-attention to process the two types of complementary information in parallel for more accurate object localization. Extensive experiments show that SyNet achieves 70.7% mAP and 73.5% NDS on the nuScenes test set, demonstrating its effectiveness and superiority as compared with the state-of-the-art.

Publication
In IEEE Transactions on Multimedia (TMM),2025