SANA-WM, a 2.6B open-source world model for 1-minute 720p video

Posted by mjgil 17 hours ago

SANA-WM, a 2.6B open-source world model for 1-minute 720p video(nvlabs.github.io)

313 points | 128 commentspage 2

joenot443 15 hours ago|

What’s the long term utility of world models?

There’s no doubt they’re technically impressive, but what does one do with it?

modeless 13 hours ago||

World models will be how general purpose robots finally work. They are essentially learned simulators of the world. They will replace traditional robotics simulators which are not flexible enough to enable training of general robotics policies. Robot control policies will be trained and evaluated in learned simulators, and the policies themselves will also be world models in order to predict the consequences of their own actions and thus enable planning. Simulated data will scale much better than expensive real-world robot data, and will allow robot policies to reach LLM-level dataset sizes, and subsequently, LLM-level performance.

It is inevitable that learned simulators will replace hand-coded simulators, as it is a straightforward application of the Bitter Lesson: http://www.incompleteideas.net/IncIdeas/BitterLesson.html

By enabling general purpose robotics, world models will be one of the most useful inventions of all time. For examples of what I'm talking about in current research, check:

Dreamer 4: https://danijar.com/project/dreamer4/

DreamDojo: https://arxiv.org/abs/2602.06949

Tesla's world model: https://www.youtube.com/watch?v=LFh9GAzHg1c

Waymo's world model: https://waymo.com/blog/2026/02/the-waymo-world-model-a-new-f...

fancyfredbot 14 hours ago|||

The world model is useful for planning. It can "anticipate" consequences of actions. This can be used for a kind of tree search to decide on optimal actions in robotics

ACCount37 15 hours ago|||

They can be base models for a bunch of things. Turning text-conditioned video generation models into robotics VLAs is a fun exercise.

This one is probably too small to be useful for that, and not diverse enough? But I could be wrong.

iinnPP 14 hours ago|||

I believe the idea is to offer simulation of ideas to test out new tasks AND something like dreaming.

whynotmaybe 15 hours ago|||

It's a step towards something else?

bix6 15 hours ago|||

Digital twin?

ollin 13 hours ago|||

Right now there is (AFAIK) no world model product booking any meaningful revenue. So there's a decent chance WMs turn out to have no long-term utility at all.

However, there are a few promising markets, assuming WMs continue to get better and cheaper:

1. Robotics training / evaluation: modern end-to-end (sensors-to-control) robot policies require simulators that are almost indistinguishable from reality. If your sim is distinguishable from reality, the evaluation metrics you get from sim don't mean anything and the policies you train in sim don't work. World models will likely be the highest-fidelity robotics simulators, since WMs are data-driven and get arbitrarily more-realistic given more data/compute. This is why so many robotics companies have WM projects [1] [2] [3] [4].

2. Video frontends for agents: in the same way that today's frontier labs are building realtime voice interfaces [5] which behave like a phone call, realtime video interfaces will behave like a video call. Early forms of this don't feel compelling IMO [6] [7], but once the models can instantly blend between rendering the agent itself, drawing diagrams/visualizations, rendering video, etc. I can see it surpassing pure voice mode.

3. Entertainment: zero-shot world generation (i.e. holodeck, genie 3; paste in an image/video/text prompt and get a world) will be a fun toy but I'm not convinced it has any long-term value. I'm more optimistic about proper narrative experiences where each scene/level is a small, carefully-crafted world (behaving like a normal film scene if you don't touch the controls, and an uncharted/TLoU-style narrative game if you do), such that the sequence of scenes builds up a larger story.

[1] https://wayve.ai/thinking/gaia-3/

[2] https://xcancel.com/Tesla/status/1982255564974641628 / https://xcancel.com/ProfKuang/status/1996642397204394179

[3] https://waymo.com/blog/2026/02/the-waymo-world-model-a-new-f...

[4] https://www.1x.tech/discover/world-model-self-learning

[5] https://thinkingmachines.ai/blog/interaction-models/

[6] https://runwayml.com/news/introducing-runway-characters

[7] https://blog.character.ai/character-ais-real-time-video-brea...

esafak 15 hours ago|||

Put them in a robot so that it can navigate the physical world like humans. Self-driving cars.

Leonard_of_Q 15 hours ago||

Games. Build campaigns in hours instead of months. Make it possible for users to create their own campaigns, move the action to different game worlds - 'gimme Mario Kart in the ${favourite_game} world', etc.

AshleyGrant 14 hours ago||

Yeah, but is this really that great? Are these models going to remember the town you wandered through on your session yesterday and want to return to?

Imagine playing Read Dead Redemption 2 and you attempt to ride your horse from Saint Denis to Valentine and Valentine no longer exists, or is a completely different town located half a mile off from where it was originally.

I just don't see how this would work...

pigpop 13 hours ago|||

If I had to use the models as they exist right now I'd use them in a procedural Myst-like where I incorporate the temporal inconsistency into the setting. The player's actions and state would affect the prompts used for conditioning the video generation. It would probably be weird and buggy but could be fun.

You could also use these models to generate assets for a game during development whether that's simple cutscenes or assets produced through gaussian splatting or some other process.

If these models and others can be run cost effectively on a cloud service or even locally at some point then you could do some interesting things by combining them with 3D mesh generation, img2img, vid2vid, etc. just think about even simple games like Papers Please and the whole genre it spawned that uses short episodes where you have to make a guess based on what you see, there's a lot of potential for creating new mechanics around generative imagery.

vidarh 13 hours ago||||

Same prompt, same seed, and yes you can ensure you get the same output, but also imagine using it as a game designer and recording the output. Imagine level editors where you prompt to fill in details, walk through it, decide which parts you don't like, and prompt for a replacement of those parts.

agentifysh 11 hours ago||||

Remember code generation ? 6 years ago you could barely get it to generate anything complex.

Remember video generation? 3 years ago the will smith spaghetti video came out.

You see how this trend will only continue? Game development is going to get really weird.

hackinthebochs 14 hours ago||||

It's not hard to imagine a system that combines deterministic state tracking with diffusion generated scenes.

dyauspitr 14 hours ago|||

Yes, a lot of models don’t state this explicitly, but they can be made deterministic. Not the generation itself, but the same prompt, with a generation seed will always result in the same output.

mkl 7 hours ago||

2.6B, but then:

> A dedicated 17B long-video refiner sharpens texture, motion, and late-window quality on top of the long-rollout backbone.

bobkb 16 hours ago||

The trouble is the lack of training available to these models compared to the ones like Seedance and Kling who seems to be tapping into their unlimited video inventory. Many models like LTX is technically good but when it comes to slightly different camera movements or the subject interacting with objects they struggle. For a recent example we had to use sample videos generated by closed source models and then use the same for final video.

vessenes 16 hours ago|

I tend to think of these NV Labs models as architectural demos and ‘free razor blades’ — they’re more intended to inform internal R&D, get customers something that lets them do what they want quickly, and enhance the state of the art.

In this case, what looks interesting is the one minute coherence and the massive speedup - they claim 36x over open models with similar capabilities. You can tell they aren’t aiming for state of the art visuals — looks very SD 1.5 in terms of the output quality.

bobkb 12 hours ago||

Agreed the marketing angle. But beyond the marketing angle what seems to matter is the access to data - look at Seedance , various Kling models etc which are far ahead of others.

vessenes 9 hours ago||

It’s hard not to believe that Google doesn’t have an amazing model in-house with all that Youtube content available. But, agreed the Chinese models seem best in the last year or so, and agreed an open policy on training data def makes for better quality

PyWoody 13 hours ago||

I tried watching the cave video and I was immediately overcome with nausea. I've never experienced anything like that before in my life. Wild.

I can't say I'm looking forward to an AI video future.

rpozarickij 11 hours ago|

When I installed very high quality (CRI 98, R9 94, virtually flicker free) light bulbs to one of my apartment rooms I had headaches and felt occasional confusion for about a week while being in that room, so I had to slowly increase the amount of time the light bulbs were turned on for. To my understanding my brain was very used to the way objects and lighting looked in that particular room so it needed to rewire some knowledge given that I've spent many thousands of hours in that room with previous light bulbs.

I'm curious if a younger me would have adapted much faster.

maxignol 9 hours ago||

I can’t seem to grasp why everyone says only slop gets produced by AI models (and particularly those world models). Imo it’s shit in -> shit out. Great work can be achieved using those. Slop gets produced by careless users.

bilsbie 11 hours ago||

What would it take to get this on VR? Anyone looking into it?

futureshock 11 hours ago|

It is plausible, the model would just need to be trained on a lot of stereoscopic data.

CommanderData 15 hours ago||

All video models are terrible at consistency. Even closed source ones.

Seedance 2.0, Kling 3 are regarded the best closed source video models we have. I have subscribed to a few AI video subreddits, consensus atm is they are good for anything but long form videos with humans.

No surprises that we're very good at spotting even the most subtle differences while looking at other people.

agentifysh 11 hours ago||

Relax its only been 3 years, it's going to get a lot better not worse from here on.

adenta 13 hours ago||

what subreddits do _you_ subscribe to?

I've been doing some content with people at https://industrialallusions.com

CommanderData 13 hours ago||

https://www.reddit.com/r/KlingAI_Videos/

https://www.reddit.com/r/HiggsfieldAI/

Higgsfield have multiple models available, people use Kling usually 2.5 & 3. There are a few good examples posted right now you'll notice the subtle differences.

I have tried to generate things myself and it's extremely hard to have more than 7-8 clips that are consistent, eventually you'll accept a compromise. I think it's why there isn't any long form content being done yet. Getting good results is sometimes just "chance" regardless of how many reference data you have.

trunkiedozer 15 hours ago||

It ain’t open source until it’s released. It’s baitware.

agus4nas 14 hours ago|

Has anyone actually tested this for robotics simulation? Curious how it handles edge cases in physical environments.

notnullorvoid 13 hours ago|

Judging by the examples it wouldn't be useful for that, the environments show little physical consistency.

More comments...