Grounded in a real city, SWM generates videos over multi-kilometer trajectories without accumulating errors, maintaining robust performance across long-horizon generation.
SWM also allows users to reshape familiar city scenes through text prompts: summon a massive wave onto the streets, drop Godzilla between skyscrapers, or imagine any scenario you like.
SWM performs retrieval-augmented generation: given geographic coordinates, camera actions, and text prompts, it retrieves nearby street-view images and conditions generation on complementary geometric and appearance references. This anchors each generated chunk to the real layout and appearance of the location.
SWM is trained on aligned pairs of street-view references and target video sequences from two sources: 1.2M real panoramic images captured across Seoul, and 10K synthetic videos from a Unreal Engine-based CARLA urban simulator spanning 431,500m² of city area.
Cross-temporal pairing requires that reference street-view images be captured at a different time from the target, forcing the model to rely on persistent spatial structure and ignore transient objects like vehicles. As shown below (left), the trained model attends to scene geometry rather than dynamic content in the references. View interpolation (right) synthesizes smooth training videos from sparse street-view keyframes (5–20m apart) using an Intermittent Freeze-Frame strategy matched to the 3D VAE's temporal stride.
Attention Visualization under Cross-Temporal Pairing
Street-View Interpolation Model
Without cross-temporal pairing, dynamic objects in the reference (e.g., vehicles) leak into the generated video. Temporal separation forces the model to focus on persistent scene structure.
To complement driving-only real trajectories, we render synthetic data from CARLA simulator with three trajectory types: pedestrian (sidewalks, crossings), vehicle (highways, urban roads), and free-camera (arbitrary collision-free paths). This diversity enables SWM to handle arbitrary camera movements at inference.
Synthetic data with diverse camera paths allows SWM to generalize beyond forward-driving trajectories, as shown in the comparison below.
SWM autoregressively generates video chunks conditioned on a text prompt, camera trajectory, and street-view images retrieved from a geo-indexed street-view database.
For each generation chunk, nearby street-view images are retrieved via nearest-neighbor search and depth-based reprojection filtering. These references condition generation through two complementary pathways: geometric referencing warps the nearest reference into the target viewpoint via depth-based splatting to provide spatial layout cues, while semantic referencing injects original reference images into the transformer's latent sequence so the model can attend to appearance details across all references.
Autoregressive generation accumulates errors over long horizons. Prior methods (a) use a static attention sink anchored to the initial frame, whose guidance weakens as the camera moves farther away. Our Virtual Lookahead (VL) Sink (b) dynamically retrieves the nearest street-view image as a virtual future destination, providing a clean, error-free anchor ahead of the current chunk. This continuously re-grounds generation, stabilizing video quality over trajectories spanning hundreds of meters. The comparison below shows that VL Sink effectively prevents quality degradation in long-horizon generation.
What if a world simulation model could render not an imagined environment but a city that actually exists? Prior generative world models synthesize visually plausible yet artificial environments by imagining all content. We present Seoul World Model (SWM), a city-scale world model grounded in the real city of Seoul. SWM anchors autoregressive video generation through retrieval-augmented conditioning on nearby street-view images. However, this design introduces several challenges, including temporal misalignment between retrieved references and the dynamic target scene, limited trajectory diversity and data sparsity from vehicle-mounted captures at sparse intervals. We address these challenges through cross-temporal pairing, a large-scale synthetic dataset enabling diverse camera trajectories, and a view interpolation pipeline that synthesizes coherent training videos from sparse street-view images. We further introduce a Virtual Lookahead Sink to stabilize long-horizon generation by continuously re-grounding each chunk to a retrieved image at a future location. We evaluate SWM against recent video world models across three cities: Seoul, Busan, and Ann Arbor. SWM outperforms existing methods in generating spatially faithful, temporally consistent, long-horizon videos grounded in actual urban environments over trajectories reaching hundreds of meters, while supporting diverse camera movements and text-prompted scenario variations.
[TBD]
This work is an industry-academic collaboration between NAVER and KAIST, utilizing NAVER Map data.
We would like to sincerely thank everyone at NAVER and NAVER Cloud who contributed their time, expertise, and feedback throughout this work. We are grateful to Jinbae Im, Moonbin Yim, and Bado Lee for their help with data preprocessing. We also thank Jongchae Na, Hyunjoon Cho, Hochul Hwang, MyoungSuk Chae, and Kwangkean Kim for their support with data processing and guidance on the use of map data and metadata. Finally, we thank Jonghak Kim, Jieun Shin, and Hyeeun Shin for valuable discussions and thoughtful feedback. Their support and collaboration are greatly appreciated.