Seoul World Model:
Grounding World Simulation Models
in a Real-World Metropolis

Junyoung Seo^*1 Hyunwook Choi^*1 Minkyung Kwon¹ Jinhyeok Choi¹ Siyoon Jin¹ Gayoung Lee² Junho Kim² JoungBin Lee¹ Geonmo Gu² Dongyoon Han^2,1 Sangdoo Yun^2,3 Seungryong Kim^†1 Jin-Hwa Kim^†2,3

¹KAIST AI ²NAVER AI Lab ³SNU AIIS

*: Equal contribution †: Co-corresponding authors

Paper arXiv Full Video Code & Weights

AI-Generated Video by SWM

TL;DR

What if a world model could render not an imagined place, but the real city outside your window? We built Seoul World Model (SWM), grounded in the actual streets of Seoul. Through RAG on millions of street-view images, SWM produces faithful videos spanning kilometers of real cityscape.

Roaming the City for Kilometers

Grounded in a real city, SWM generates videos over multi-kilometer trajectories without accumulating errors, maintaining robust performance across long-horizon generation.

20x speed AI-Generated Video by SWM

Navigate Freely, Like in a Video Game

From strolling down a sidewalk to glancing around a corner or cruising along a highway, SWM supports free-form navigation beyond simple forward driving, letting users explore the city through arbitrary camera trajectories.

AI-Generated Video by SWM

Text-Prompted Scenario Control

SWM also allows users to reshape familiar city scenes through text prompts: summon a massive wave onto the streets, drop Godzilla between skyscrapers, or imagine any scenario you like.

AI-Generated Video by SWM

World Model RAG with Street-View Database

SWM performs retrieval-augmented generation: given geographic coordinates, camera actions, and text prompts, it retrieves nearby street-view images and conditions generation on complementary geometric and appearance references. This anchors each generated chunk to the real layout and appearance of the location.

AI-Generated Video by SWM

Data Overview

SWM is trained on aligned pairs of street-view references and target video sequences from two sources: 1.2M real panoramic images captured across Seoul, and 10K synthetic videos from a Unreal Engine-based CARLA urban simulator spanning 431,500m² of city area.

Cross-Temporal Pairing & Street-View Interpolation

Cross-temporal pairing requires that reference street-view images be captured at a different time from the target, forcing the model to rely on persistent spatial structure and ignore transient objects like vehicles. As shown below (left), the trained model attends to scene geometry rather than dynamic content in the references. View interpolation (right) synthesizes smooth training videos from sparse street-view keyframes (5–20m apart) using an Intermittent Freeze-Frame strategy matched to the 3D VAE's temporal stride.

Cross-temporal pairing attention visualization

Attention Visualization under Cross-Temporal Pairing

Street-View Interpolation Model

Without cross-temporal pairing, dynamic objects in the reference (e.g., vehicles) leak into the generated video. Temporal separation forces the model to focus on persistent scene structure.

Unreal Engine-based Synthetic Data

To complement driving-only real trajectories, we render synthetic data from CARLA simulator with three trajectory types: pedestrian (sidewalks, crossings), vehicle (highways, urban roads), and free-camera (arbitrary collision-free paths). This diversity enables SWM to handle arbitrary camera movements at inference.

Examples of synthetic training data

Synthetic data with diverse camera paths allows SWM to generalize beyond forward-driving trajectories, as shown in the comparison below.

Model Overview

SWM autoregressively generates video chunks conditioned on a text prompt, camera trajectory, and street-view images retrieved from a geo-indexed street-view database.

Retrieval & Referencing

For each generation chunk, nearby street-view images are retrieved via nearest-neighbor search and depth-based reprojection filtering. These references condition generation through two complementary pathways: geometric referencing warps the nearest reference into the target viewpoint via depth-based splatting to provide spatial layout cues, while semantic referencing injects original reference images into the transformer's latent sequence so the model can attend to appearance details across all references.

Virtual Lookahead Sink

Autoregressive generation accumulates errors over long horizons. Prior methods (a) use a static attention sink anchored to the initial frame, whose guidance weakens as the camera moves farther away. Our Virtual Lookahead (VL) Sink (b) dynamically retrieves the nearest street-view image as a virtual future destination, providing a clean, error-free anchor ahead of the current chunk. This continuously re-grounds generation, stabilizing video quality over trajectories spanning hundreds of meters. The comparison below shows that VL Sink effectively prevents quality degradation in long-horizon generation.

Abstract

What if a world simulation model could render not an imagined environment but a city that actually exists? Prior generative world models synthesize visually plausible yet artificial environments by imagining all content. We present Seoul World Model (SWM), a city-scale world model grounded in the real city of Seoul. SWM anchors autoregressive video generation through retrieval-augmented conditioning on nearby street-view images. However, this design introduces several challenges, including temporal misalignment between retrieved references and the dynamic target scene, limited trajectory diversity and data sparsity from vehicle-mounted captures at sparse intervals. We address these challenges through cross-temporal pairing, a large-scale synthetic dataset enabling diverse camera trajectories, and a view interpolation pipeline that synthesizes coherent training videos from sparse street-view images. We further introduce a Virtual Lookahead Sink to stabilize long-horizon generation by continuously re-grounding each chunk to a retrieved image at a future location. We evaluate SWM against recent video world models across three cities: Seoul, Busan, and Ann Arbor. SWM outperforms existing methods in generating spatially faithful, temporally consistent, long-horizon videos grounded in actual urban environments over trajectories reaching hundreds of meters, while supporting diverse camera movements and text-prompted scenario variations.

Acknowledgements

We would like to sincerely thank everyone at NAVER and NAVER Cloud who contributed their time, expertise, and feedback throughout this work. We are grateful to Jinbae Im, Moonbin Yim, and Bado Lee for their help with data preprocessing. We also thank Jongchae Na, Hyunjoon Cho, Hochul Hwang, MyoungSuk Chae, and Kwangkean Kim for their support with data processing and guidance on the use of map data and metadata. Finally, we thank Jonghak Kim, Jieun Shin, and Hyeeun Shin for valuable discussions and thoughtful feedback. Their support and collaboration are greatly appreciated.

Seoul World Model: Grounding World Simulation Models in a Real-World Metropolis

Roaming the City for Kilometers

Navigate Freely, Like in a Video Game

Text-Prompted Scenario Control

World Model RAG with Street-View Database

Data Overview

Cross-Temporal Pairing & Street-View Interpolation

Unreal Engine-based Synthetic Data

Model Overview

Retrieval & Referencing

Virtual Lookahead Sink

Abstract

Citation

Acknowledgements

Seoul World Model:
Grounding World Simulation Models
in a Real-World Metropolis