Recent generative video models produce videos that look realistic and temporally coherent, hinting at the emergence of implicit internal representations—or world models—that understand the dynamics of the physical world. However, the extent to which these models capture and utilize such world models remains an open question. In this paper, we propose a method to track the emergence of dynamics understanding in video generation models by leveraging point tracking as a diagnostic tool.
Utilizing both inversion techniques and interpretable visual prompts we term interventions, we assess the model's ability to perform zero-shot generalization—tracking the marker through complex motions, occlusions, and transformation to evaluate their internal understanding of physical dynamics.
Our experiments reveal that state-of-the-art video diffusion models can effectively propagate the introduced markers without prior exposure to such tasks, providing empirical evidence of robust world modeling capabilities. This dual exploration not only demonstrates a novel application of video generation models for point tracking but also offers a lens to track the internal emergence of dynamic understanding within these models. We argue that point tracking performance should serve as a defining metric for evaluating and guiding the development of generative video models. Our findings shed light on the hidden mechanisms of these models and pave the way for future research into their world modeling properties.
Our approach introduces an intervention-based framework to explore and evaluate the internal world models of generative video models through point tracking. This method utilizes visual prompts as interventions, inversion noise scheduling, and diffusion processes to propagate these interventions across frames, allowing us to track dynamics within the generated videos.