The synthesis of spatiotemporally coherent 4D content presents fundamental challenges in computer vision, requiring simultaneous modeling of high-fidelity spatial representations and physically plausible temporal dynamics. Current approaches often struggle to maintain view consistency while handling complex scene dynamics, particularly in large-scale environments with multiple interacting elements. This work introduces Dream4D, a novel framework that bridges this gap through a synergy of controllable video generation and neural 4D reconstruction. Our approach seamlessly combines a two-stage architecture: it first predicts optimal camera trajectories from a single image using few-shot learning, then generates geometrically consistent multi-view sequences via a specialized pose-conditioned diffusion process, which are finally converted into a persistent 4D representation. This framework is the first to leverage both rich temporal priors from video diffusion models and geometric awareness of the reconstruction models, which significantly facilitates 4D generation and shows higher quality (e.g., mPSNR, mSSIM) over existing methods.
Method | mPSNR(dB) ↑ | mSSIM ↑ | mLPIPS ↓ |
Dream4D | 20.56 | 0.702 | 0.170 |
Megasam | 17.625 | 0.601 | 0.207 |
Shape-of-Motion | 16.72 | 0.630 | 0.450 |
Cut3r | 14.69 | 0.543 | 0.341 |
CamI2V | 14.08 | 0.449 | 0.334 |
SeVA | 12.67 | 0.495 | 0.579 |
The superior performance of our method is not merely reflected in quantitative gains or visual quality, but fundamentally stems from its ability to jointly model geometry and dynamics while maintaining multi-view and temporal consistency. The analysis highlights how architectural choices—such as the integration of pose-aware warping and dynamic feature refinement— directly address common failure modes in existing methods, such as structural fragmentation, motion blur, and flickering. By examining both numerical trends and visual artifacts across baselines, we identify that competing approaches often sacrifice one aspect of reconstruction quality (e.g., geometry) for another (e.g., smoothness), whereas our design enables synergistic improvements.
Variant | mPSNR(dB) ↑ | mSSIM ↑ | mLPIPS ↓ |
Dynamic Module + 4D Generator | 19.78 | 0.6686 | 0.1220 |
Only Dynamic Module | 18.37 | 0.6361 | 0.1841 |
δ | -1.41 | -0.0325 | +0.0621 |
Static Module + 4D Generator | 13.35 | 0.6550 | 0.2996 |
Only Static Module | 12.56 | 0.5779 | 0.3461 |
δ | -0.79 | -0.0771 | +0.0465 |
This table presents a quantitative ablation study comparing the performance impact of the 4D Generator module across different system configurations. The data shows consistent improvements when integrating the 4D Generator: for dynamic scenarios, it boosts mPSNR by 1.41 dB (18.37→19.78) and mSSIM by 0.0325 (0.6361→0.6686) while reducing mLPIPS by 0.0621; similar gains occur in static scenarios with 0.79 dB mPSNR and 0.0771 mSSIM improvements. The △ rows clearly demonstrate the 4D Generator's dual capability to enhance both spatial quality metrics (mPSNR/mSSIM) and temporal consistency (mLPIPS) across different operational modes.