ImViD: Immersive Volumetric Videos for Enhanced VR Engagement

✨CVPR 2025     Highlight

Zhengxian Yang1,*     Shi Pan1,*     Shengqi Wang1,*     Haoxiang Wang1,     Li Lin2     Guanjun Li3     Zhengqi Wen1,†     Borong Lin1,†     Jianhua Tao1,†     Tao Yu1,†    
1 Tsinghua University        2 Migu Beijing Research Institue       3 Institute of Automation, Chinese Academy of Science      

 

             2              3
* Equal Contributions       Corresponding authors


We introduce ImViD, a dataset for immersive volumetric videos. ImViD records dynamic scenes using a multi-view audio-video capture rig moving in a space-oriented manner, which provides a new benchmark for volumetric video reconstruction and its application.

Abstract

User engagement is greatly enhanced by fully immersive multi-modal experiences that combine visual and auditory stimuli. Consequently, the next frontier in VR/AR technologies lies in immersive volumetric videos with complete scene capture, large 6-DoF interaction space, multi-modal feedback, and high resolution & frame-rate contents. To stimulate the reconstruction of immersive volumetric videos, we introduce ImViD, a multi-view, multi-modal dataset featuring complete space-oriented data capture and various indoor/outdoor scenarios.

Our capture rig supports multi-view video-audio capture while on the move, a capability absent in existing datasets, significantly enhancing the completeness, flexibility, and efficiency of data capture.


Video: Real-Time 6-DoF Multi-Model VR Experience

 

 

Pipeline & Method


We propose a pipeline to realize the multimodal 6-DoF immersive VR experiences. We applied a carefully designed rig to (a) simultaneously capture multi-view video and audio. The (b1) presents our reconstruction of dynamic light field based on STG [2] while (b2) demonstrates the construction process of sound field. We have achieved better results than the original algorithm in long-term dynamic scenes by incorporating affine color transformation and t-dimensional density control. Ultimately, we achieve a 6-DoF immersive experience in both light and sound fields, and also benchmark on recent representative methods like 4DGS [3] and 4Drotor [4] to demonstrate the effectiveness of both our dataset and baseline method.

The handheld monocular camera is easy to move, providing limited perspectives from various locations. In contrast, the fixed camera arrays, while stationary, offers dense perspectives within a limited range. We aim to combine the advantages of both to design an effective data collection system and strategy for fully immersive VR experience -- a feature not found in existing datasets.

Note: Here we present the STG++ reconstruction results of four scenes in the ImViD dataset. Scene1 Opera, Scene2 Laboratory, Scene5 Rendition, Scene6 Puppy. Due to webpage size limitations, videos are compressed to some extent.

Benchmark & Results

Comparison of average metrics for three baselines and STG++ across four scenes in our dataset. Due to its smaller model size, 4DGS [3] can train 300 frames at once, so there are no segmented results.

Visual Comparisons with Other 4DGS-based Methods

Comparisons on Color Temporal Consistency

We apply color mapping to tackle color differences due to factors like occlusions when capturing from different angles. We show the continuity of pixels at the same location across different frames and segments. The visualizations present the RGB (middle) and brightness (right) variations. Our STG++ achieves better temporal continuity. There are almost no abrupt transitions both within and between segments.

Effectiveness of Sound Field Reconstruction

We conducted a user study with 21 experts to assess the performance of the sound field reconstruction. Each participant rated the sound field reconstruction on three dimensions: auditory spatial perception, sound quality, and immersiveness. Auditory Spatial Perception refers to the listener's ability to perceive the distribution and localization of sound in space. Sound Quality refers to the audio quality of the spatial audio compared to the corresponding microphone-recorded signal. Immersiveness refers to the listener's sense of being in a space where the sound source genuinely exists.

BibTeX

@misc{yang2025imvidimmersivevolumetricvideos,
      title={ImViD: Immersive Volumetric Videos for Enhanced VR Engagement}, 
      author={Zhengxian Yang and Shi Pan and Shengqi Wang and Haoxiang Wang and Li Lin and Guanjun Li and Zhengqi Wen and Borong Lin and Jianhua Tao and Tao Yu},
      year={2025},
      eprint={2503.14359},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.14359}, 
}

 

References

[1] Michael Broxton, John Flynn, Ryan Overbeck, Daniel Erickson, Peter Hedman, Matthew DuVall, Jason Dourgarian, Jay Busch, Matt Whalen, and Paul Debevec. Immersive Light Field Video with a Layered Mesh Representation. ACM Transactions on Graphics, 2020.
[2] Zhan Li, Zhang Chen, Zhong Li, and Yi Xu. Spacetime gaussian feature splatting for real-time dynamic view synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
[3] Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Xinggang Wang. 4d gaussian splatting for real-time dynamic scene rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
[4] Yuanxing Duan, Fangyin Wei, Qiyu Dai, Yuhang He, Wenzheng Chen, and Baoquan Chen. 4d-rotor gaussian splatting: Towards efficient novel view synthesis for dynamic scenes. In ACM SIGGRAPH 2024 Conference Papers.