Posted by mjgil 17 hours ago
There’s no doubt they’re technically impressive, but what does one do with it?
It is inevitable that learned simulators will replace hand-coded simulators, as it is a straightforward application of the Bitter Lesson: http://www.incompleteideas.net/IncIdeas/BitterLesson.html
By enabling general purpose robotics, world models will be one of the most useful inventions of all time. For examples of what I'm talking about in current research, check:
Dreamer 4: https://danijar.com/project/dreamer4/
DreamDojo: https://arxiv.org/abs/2602.06949
Tesla's world model: https://www.youtube.com/watch?v=LFh9GAzHg1c
Waymo's world model: https://waymo.com/blog/2026/02/the-waymo-world-model-a-new-f...
This one is probably too small to be useful for that, and not diverse enough? But I could be wrong.
However, there are a few promising markets, assuming WMs continue to get better and cheaper:
1. Robotics training / evaluation: modern end-to-end (sensors-to-control) robot policies require simulators that are almost indistinguishable from reality. If your sim is distinguishable from reality, the evaluation metrics you get from sim don't mean anything and the policies you train in sim don't work. World models will likely be the highest-fidelity robotics simulators, since WMs are data-driven and get arbitrarily more-realistic given more data/compute. This is why so many robotics companies have WM projects [1] [2] [3] [4].
2. Video frontends for agents: in the same way that today's frontier labs are building realtime voice interfaces [5] which behave like a phone call, realtime video interfaces will behave like a video call. Early forms of this don't feel compelling IMO [6] [7], but once the models can instantly blend between rendering the agent itself, drawing diagrams/visualizations, rendering video, etc. I can see it surpassing pure voice mode.
3. Entertainment: zero-shot world generation (i.e. holodeck, genie 3; paste in an image/video/text prompt and get a world) will be a fun toy but I'm not convinced it has any long-term value. I'm more optimistic about proper narrative experiences where each scene/level is a small, carefully-crafted world (behaving like a normal film scene if you don't touch the controls, and an uncharted/TLoU-style narrative game if you do), such that the sequence of scenes builds up a larger story.
[1] https://wayve.ai/thinking/gaia-3/
[2] https://xcancel.com/Tesla/status/1982255564974641628 / https://xcancel.com/ProfKuang/status/1996642397204394179
[3] https://waymo.com/blog/2026/02/the-waymo-world-model-a-new-f...
[4] https://www.1x.tech/discover/world-model-self-learning
[5] https://thinkingmachines.ai/blog/interaction-models/
[6] https://runwayml.com/news/introducing-runway-characters
[7] https://blog.character.ai/character-ais-real-time-video-brea...
Imagine playing Read Dead Redemption 2 and you attempt to ride your horse from Saint Denis to Valentine and Valentine no longer exists, or is a completely different town located half a mile off from where it was originally.
I just don't see how this would work...
You could also use these models to generate assets for a game during development whether that's simple cutscenes or assets produced through gaussian splatting or some other process.
If these models and others can be run cost effectively on a cloud service or even locally at some point then you could do some interesting things by combining them with 3D mesh generation, img2img, vid2vid, etc. just think about even simple games like Papers Please and the whole genre it spawned that uses short episodes where you have to make a guess based on what you see, there's a lot of potential for creating new mechanics around generative imagery.
Remember video generation? 3 years ago the will smith spaghetti video came out.
You see how this trend will only continue? Game development is going to get really weird.
> A dedicated 17B long-video refiner sharpens texture, motion, and late-window quality on top of the long-rollout backbone.
In this case, what looks interesting is the one minute coherence and the massive speedup - they claim 36x over open models with similar capabilities. You can tell they aren’t aiming for state of the art visuals — looks very SD 1.5 in terms of the output quality.
I can't say I'm looking forward to an AI video future.
I'm curious if a younger me would have adapted much faster.
Seedance 2.0, Kling 3 are regarded the best closed source video models we have. I have subscribed to a few AI video subreddits, consensus atm is they are good for anything but long form videos with humans.
No surprises that we're very good at spotting even the most subtle differences while looking at other people.
I've been doing some content with people at https://industrialallusions.com
https://www.reddit.com/r/HiggsfieldAI/
Higgsfield have multiple models available, people use Kling usually 2.5 & 3. There are a few good examples posted right now you'll notice the subtle differences.
I have tried to generate things myself and it's extremely hard to have more than 7-8 clips that are consistent, eventually you'll accept a compromise. I think it's why there isn't any long form content being done yet. Getting good results is sometimes just "chance" regardless of how many reference data you have.