Dream4D: Lifting Camera-Controlled I2V towards Spatiotemporally Consistent 4D Generation

Xiaoyan Liu Kangrui Li JiaXin Liu

4D Reconstruction from input conditions. We present a comparative visualization of 3D scene generation from static versus dynamic inputs, organized into three distinct examples. The left "Static Scene" section demonstrates bedroom reconstruction from a text/image input ("A minimalist bedroom with light colors and wooden furniture"), showing an initial 3D point-cloud representation followed by four coherent multi-view renders. The central and right "Dynamic Scene" sections illustrate temporal modeling capabilities: the middle sequence captures a turtle's movement on a rock (input: "A turtle sunbathes on a mossy rock by a river"), while the rightmost sequence shows a school bus traversing left-to-right (input: "A yellow school bus drives left-to-right past signs"), both displaying consistent object motion through four temporal frames. Green annotations in dynamic cases highlight spatiotemporal feature tracking, contrasting with the static scene's view-consistent geometry. The visualization effectively contrasts single-view reconstruction (static) with view-consistent motion synthesis (dynamic) within a unified framework.

Abstract

The synthesis of spatiotemporally coherent 4D content presents fundamental challenges in computer vision, requiring simultaneous modeling of high-fidelity spatial representations and physically plausible temporal dynamics. Current approaches often struggle to maintain view consistency while handling complex scene dynamics, particularly in large-scale environments with multiple interacting elements. This work introduces Dream4D, a novel framework that bridges this gap through a synergy of controllable video generation and neural 4D reconstruction. Our approach seamlessly combines a two-stage architecture: it first predicts optimal camera trajectories from a single image using few-shot learning, then generates geometrically consistent multi-view sequences via a specialized pose-conditioned diffusion process, which are finally converted into a persistent 4D representation. This framework is the first to leverage both rich temporal priors from video diffusion models and geometric awareness of the reconstruction models, which significantly facilitates 4D generation and shows higher quality (e.g., mPSNR, mSSIM) over existing methods.

More Visualization on Out-of-domain Images (512 $\times$ 320)

*Generated by 512 $\times$ 320 model (50k training steps), compatible with input images of arbitary aspect ratio.

Automobile

Sika deer

French Bulldog

Eagel

Crocodile

Fox

Seagull

Flamingo

Cheetah

Lion

Red racing car

Tiger

Visualization on Static Demos

Multi-object static generation showcase: Generated using static modules and 4D generator.

Sleeping pup by vintage bike on cobblestone street.

Cozy living room with orange sofa and tea set.

Sunlit lounge with brown sofa and thriving greenery.

A minimalist bedroom with light colors and wooden furniture.

Quantitative Comparison

Method	mPSNR(dB) ↑	mSSIM ↑	mLPIPS ↓
Dream4D	20.56	0.702	0.170
Megasam	17.625	0.601	0.207
Shape-of-Motion	16.72	0.630	0.450
Cut3r	14.69	0.543	0.341
CamI2V	14.08	0.449	0.334
SeVA	12.67	0.495	0.579

The superior performance of our method is not merely reflected in quantitative gains or visual quality, but fundamentally stems from its ability to jointly model geometry and dynamics while maintaining multi-view and temporal consistency. The analysis highlights how architectural choices—such as the integration of pose-aware warping and dynamic feature refinement— directly address common failure modes in existing methods, such as structural fragmentation, motion blur, and flickering. By examining both numerical trends and visual artifacts across baselines, we identify that competing approaches often sacrifice one aspect of reconstruction quality (e.g., geometry) for another (e.g., smoothness), whereas our design enables synergistic improvements.

Ablation Study

Variant	mPSNR(dB) ↑	mSSIM ↑	mLPIPS ↓
Dynamic Module + 4D Generator	19.78	0.6686	0.1220
Only Dynamic Module	18.37	0.6361	0.1841
δ	-1.41	-0.0325	+0.0621
Static Module + 4D Generator	13.35	0.6550	0.2996
Only Static Module	12.56	0.5779	0.3461
δ	-0.79	-0.0771	+0.0465

This table presents a quantitative ablation study comparing the performance impact of the 4D Generator module across different system configurations. The data shows consistent improvements when integrating the 4D Generator: for dynamic scenarios, it boosts mPSNR by 1.41 dB (18.37→19.78) and mSSIM by 0.0325 (0.6361→0.6686) while reducing mLPIPS by 0.0621; similar gains occur in static scenarios with 0.79 dB mPSNR and 0.0771 mSSIM improvements. The △ rows clearly demonstrate the 4D Generator's dual capability to enhance both spatial quality metrics (mPSNR/mSSIM) and temporal consistency (mLPIPS) across different operational modes.

Qualitative Comparison of 4D Reconstruction Methods on Static and Dynamic Scenes

The left panel displays input frames, while the middle blue-framed section shows our method's results featuring complete geometry and sharp details in both scene types. In comparison, the yellow-framed right section reveals typical artifacts in baseline outputs: structural distortions in static scenes and temporal inconsistencies in dynamic sequences, with arrows highlighting key failure cases like blurred surfaces or broken geometries.

Ablation Study on 4D Generator's Impact for Dynamic and Static Scene Modeling

This 2×2 grid figure presents an ablation study comparing the isolated and combined performance of dynamic/static modules with the 4D Generator. The top row shows baseline outputs (left: dynamic module with temporal flickering artifacts highlighted by red boxes; right: static module exhibiting multi-view inconsistencies), while the bottom row demonstrates significantly improved results when integrated with the 4D Generator - the left dynamic+4D case shows smoothed temporal transitions (arrows indicating coherent motion), and the right static+4D case displays geometrically consistent multi-view renders. Red bounding boxes strategically emphasize key quality improvements in texture details (static) and motion continuity (dynamic), visually validating the 4D Generator's dual enhancement capability for both spatial and temporal reconstruction tasks.

From Text to 4D: Pipeline Diagram of Dream4D's Hybrid VLM-Transformer-Diffusion Architecture

Pipeline Overview. Our methods illustrates a multimodal architecture for generating dynamic videos from static images and text prompts. The pipeline begins with two parallel input streams: (1) an Image Encoder processes the input image (e.g., a dog resting on a cobblestone street with bicycles and distant pedestrians), while (2) a Tokenizer + Embedding Layer extracts semantic features from the text prompt (e.g., ``pan left''). These modalities are aligned via a Vision-Language Projector, which maps them into a unified Shared Embedding Space augmented with Positional Encoding for spatiotemporal coherence. The fused representation is then processed by a Transformer Encoder, leveraging both Vision-Language Model (VLM) and Large Language Model (LLM) components to interpret cross-modal instructions (e.g., viewpoint changes). Subsequent refinement occurs through a Diffusion Transformer (DiT) Block, where patch-based operations and multi-head attention enable controlled video synthesis. Finally, the Decoder Component generates frame sequences, transforming latent features into output 4D videos.