Tracking Emergence in Video Generation Models

Cristobal Eyzaguirre*, Yunong Liu*, Stefan Stojanov, Adrien Gaidon, Juan Carlos Niebles, Jiajun Wu
(*Alphabetically)
Ongoing ICCV Submission

Abstract

Recent generative video models produce videos that look realistic and temporally coherent, hinting at the emergence of implicit internal representations—or world models—that understand the dynamics of the physical world. However, the extent to which these models capture and utilize such world models remains an open question. In this paper, we propose a method to track the emergence of dynamics understanding in video generation models by leveraging point tracking as a diagnostic tool.

Utilizing both inversion techniques and interpretable visual prompts we term interventions, we assess the model's ability to perform zero-shot generalization—tracking the marker through complex motions, occlusions, and transformation to evaluate their internal understanding of physical dynamics.

Our experiments reveal that state-of-the-art video diffusion models can effectively propagate the introduced markers without prior exposure to such tasks, providing empirical evidence of robust world modeling capabilities. This dual exploration not only demonstrates a novel application of video generation models for point tracking but also offers a lens to track the internal emergence of dynamic understanding within these models. We argue that point tracking performance should serve as a defining metric for evaluating and guiding the development of generative video models. Our findings shed light on the hidden mechanisms of these models and pave the way for future research into their world modeling properties.

Method

Video Generation Paradigms
Figure 1: Comparison of video generation paradigms. Top: The existing approach, where video diffusion models generate sequences based on noise and minimal conditioning inputs. Bottom: Our proposed method introduces a visual prompt to an input video, leveraging the model's internal dynamics.

Our approach introduces an intervention-based framework to explore and evaluate the internal world models of generative video models through point tracking. This method utilizes visual prompts as interventions, inversion noise scheduling, and diffusion processes to propagate these interventions across frames, allowing us to track dynamics within the generated videos.

Intervention Propagation Model
Figure 2: Overview of the Intervention Propagation Model. Left: Intervention is introduced in the initial frame by adding a visual marker. Noise scheduling and Classifier-Free Guidance in the diffusion model propagate this intervention through subsequent frames, resulting in a sequence that reflects dynamic responses to the initial intervention. Right: Comparison of frames with and without intervention allows for tracking dynamics and detecting motion patterns, helping to reveal the model's internal understanding of physical dynamics.
Adaptive Searching Strategies
Figure 3: Illustration of Adaptive Searching Strategies. This figure demonstrates the adaptive search space expansion, chunk size adaptation based on tracking confidence, and the use of soft and hard constraints for stable tracking.

Results

Successful Cases

Ground Truth
Prediction
Ground Truth
Prediction
Ground Truth
Prediction
Ground Truth
Prediction
Figure 4: Successful tracking cases across various scenarios. The model demonstrates robust tracking performance in scenes with regular motion patterns and clear visibility.

Failure Cases - Dynamic Scenes

Ground Truth
Prediction
Ground Truth
Prediction
Figure 5: Failure cases in complex dynamic scenes. The model struggles with scenes containing multiple moving objects and complex interactions.

Failure Cases - Fast Motion

Ground Truth
Prediction
Ground Truth
Prediction
Figure 6: Failure cases in rapid motion scenarios. The model has difficulty maintaining accurate tracking when objects move very quickly or undergo sudden direction changes.