Voyager – An interactive video generation model with realtime 3D reconstruction

Posted by mingtianzhang 9/3/2025

Voyager – An interactive video generation model with realtime 3D reconstruction(github.com)

322 points | 225 commentspage 2

geokon 9/3/2025|

Seems the kind of thing StreetView data would have been perfect to train on.

I wonder if you could loop back the last frame of each video to extend the generated world further. Creating a kind of AI fever dream

kridsdale1 9/3/2025|

Why the past tense? Google is holding on to all of that, going back years.

Cthulhu_ 9/3/2025||

Yeah, they have all the raw data (Google is a self-confessed data hoarder, after all), I'm sure they have research projects where they use AI and similar to stitch street view images together.

I also wouldn't be surprised if their Street View cars / people record video instead of stills these days. Assuming they started capturing stuff in 2007 (and it was probably a lot earlier), storage technology has improved at least tenfold in terms of storage (probably more), video processing too.

amelius 9/3/2025||

Can I use this to replace a LiDAR?

ENGNR 9/3/2025||

Depends how many liberties it takes in imagining the world

Lidar is direct measurement

incone123 9/3/2025|||

It's generating a 3d world from a photo or other image, rather than giving you a 3d model of the real world.

amelius 9/3/2025||

Look at the examples. It can generate a depth map.

incone123 9/3/2025|||

Yes, which may be fine or not depending on the end goal of the person I replied to. Some applications need the certainty of LiDAR and others can tolerate it if the model makes some mistakes.

gs17 9/3/2025|||

Yes, but that's not a novel part of this. We've been able to do that for a while (a long while if you count binocular or time-of-flight vision systems).

forrestthewoods 9/3/2025|||

… absolutely not no this sentence doesn’t even make sense.

Good grief orange site sometimes I swear.

garbthetill 9/3/2025||

if it does, then elon really won the bet of no lidar

odie5533 9/3/2025|||

All he had to do was remove the LIDAR and wait 15-20 years for the tech to catch up. I'm sure Tesla owners don't mind waiting. They're used to it by now.

HeWhoLurksLate 9/3/2025||||

there's a huge difference between "feature that mostly works and is kinda neat" and "5000 pound robot relies on this to work all the time or people will probably get hurt at minimum" in how much you should trust a feature.

Doesn't really matter if an imgTo3d script gets a face's depth map inverted, kinda problematic if your car doesn't think there's something where there is.

Cthulhu_ 9/3/2025|||

I wasn't aware there was a competition or a bet.

user_7832 9/3/2025||

I see a lot of skeptical folks here... isn't this the first such model? I remember seeing a lot of image to 3d models before, but they'd all produce absurd results in a few moments. This seems to produce really good output in comparison.

explorigin 9/3/2025||

If you click on the link, they show a comparison chart with other similar models.

neuronic 9/3/2025||

> isn't this the first such model?

The linked Github page has a comparison with other world models...

londons_explore 9/3/2025||

> The minimum GPU memory required is 60GB for 540p.

We're about to see next gen games requiring these as minimum system requirements...

krystofee 9/3/2025||

I think its a matter of time when we will have photorealistic playable computer games generated by these engines.

Keyframe 9/3/2025||

There's a reason Tencent is doing this https://en.wikipedia.org/wiki/Tencent#Foreign_studio_assets

netsharc 9/3/2025|||

Yeah, MS Flight Simulator with a world that's "inspired by" ours... The original 2020 version had issues with things like the Sydney Harbour Bridge (did it have the Opera House?), using AI to generate 3D models of these things based on pictures would be crazy (of course they'd generate once, on 1st request).

So if you're the first to approach the Opera House, it would ask the engine for 3D models of the area, and it would query its image database, see the fancy opera house, and generate its own interpretation.. if there's no data (e.g. a landscape in the middle of Africa), it'd use the satellite image plus typical fauna of the region..

gadders 9/3/2025||

And hopefully AI-powered NPCs to fight against/interact with.

Cthulhu_ 9/3/2025||

I believe there's games that have that already. My concern is that it's all going to be sameish slop. Read ten AI generated stories and you've read them all.

It could work, but they would have to both write unique prompts for each NPC (instead of "generate me 100 NPC personality prompts") and limit the possible interactions and behaviours.

But, emergent / generative behaviour would be interesting to a point. There's plenty of roguelikes / roguelites where this could work in, given their generative behaviours.

gadders 9/3/2025||

I guess for combat, you would want ones that could sensibly work together and adapt, possibly different levels of aggression, stealth etc Even as good as FEAR would be something.

indiantinker 9/3/2025||

Matrix

pbd 9/3/2025||

This is genuinely exciting.

bglazer 9/3/2025|

Please don’t post chatgpt output

SirHackalot 9/3/2025||

> Minimum: The minimum GPU memory required is 60GB for 540p.

Cool, I guess… If you have tens of thousands of $ to drop on a GPU for output that’s definitely not usable in any 3D project out-of-the-box.

kittoes 9/3/2025||

https://www.amd.com/en/products/accelerators/radeon-pro/amd-...

Is more approachable than one might think, as you can currently find two of these for less than 1,000 USD.

esafak 9/3/2025||

How much performance penalty is there for doubling up? What about 4x?

kittoes 9/3/2025||

I just found out about these last week and haven't received the hardware yet, so I can't give you real numbers. That said, one can probably expect at least a 10-30% penalty when the cards need to communicate with one another. Other workloads that don't require constant communication between cards can actually expect a performance boost. Your mileage will vary.

HPsquared 9/3/2025|||

I assume it can be split between multiple GPUs, like LLMs can. Or hire an H100 for like $3/hr.

y-curious 9/3/2025||

I mean, still awesome that it's OSS. Can probably just rent GPU time online for this

mingtianzhang 9/3/2025|

What's your opinion on modeling the world? Some people think the world is 3D, so we need to model the 3D world. Some people think that since human perception is 2D, we can just model the 2D view rather than the underlying 3D world, since we don't have enough 3D data to capture the world but we have many 2D views.

Fixed question: Thanks a lot for the feedback that human perception is not 2D. Let me rephrase the question: since all the visual data we see on computers can be represented as 2D images (indexed by time, angle, etc.), and we have many such 2D datasets, do we still need to explicitly model the underlying 3D world?

AIPedant 9/3/2025||

Human perception is not 2D, touch and proprioception[1] are three-dimensional senses.

And of course it really makes more sense to say human perception is 3+1-dimensional since we perceive the passage of time.

[1] https://en.wikipedia.org/wiki/Proprioception

WithinReason 9/3/2025||

the sensors are 2D

soulofmischief 9/3/2025|||

Two of them, giving us stereo vision. We are provided visual cues that encode depth. The ideal world model would at least have this. A world model for a video game on a monitor might be able to get away with no depth information, but a) normal engines do have this information and it would make sense to provide as much data to a general model as possible, and b) the models wouldn't work on AR/VR. Training on stereo captures seems like a win all around.

WithinReason 9/3/2025||

> We are provided visual cues that encode depth. The ideal world model would at least have this.

None of these world models have explicit concepts of depth or 3D structure, and adding it would go against the principle of the Bitter Lesson. Even with 2 stereo captures there is no explicit 3D structure.

soulofmischief 9/3/2025||

Increasing the fidelity and richness of training data does not go against the bitter lesson.

The model can learn 3D representation on its own from stereo captures, but there is still richer, more connected data to learn from with stereo captures vs monocular captures. This is unarguable.

You're needlessly making things harder by forcing the model to also learn to estimate depth from monocular images, and robbing it of a channel for error-correction in the case of faulty real-world data.

WithinReason 9/3/2025||

Stereo images have no explicit 3D information and are just 2D sensor data. But even if you wanted to use stereo data, you would restrict yourself to stereo datasets and wouldn't be able to use 99.9% of video data out there to train on which wasn't captured in stereo, that's the part that's against the Bitter Lesson.

soulofmischief 9/3/2025|||

You don't have to restrict yourself to that, you can create synthetic data or just train on both kinds of data.

I still don't understand what the bitter lesson has to do with this. First of all, it's only a piece of writing, not dogma, and second of all it concerns itself with algorithms and model structure itself, increasing the amount of data available to train on does not conflict with it.

reactordev 9/3/2025||||

Incorrect. My sense of touch can be activated in 3 dimensions by placing my hand near a heat source. Which radiates in 3 dimensions.

Nevermark 9/3/2025||

You are still sensing heat across 2 dimensions of skin.

The 3rd dimension gets inferred from that data.

(Unless you have a supernatural sensory aura!)

AIPedant 9/3/2025||

The point is that knowing where your hand is in space relative to the rest of your body is a distinct sense which is directly three-dimensional. This information is not inferred, it is measured with receptors in your joints and ligaments.

Nevermark 9/3/2025||

No it is inferred.

You are inferring 3D positions based on many sensory signals combined.

From mechanoreceptors and proprioceptors located in our skin, joints, and muscles.

We don’t have 3-element position sensors, nor do we have 3-d sensor volumes, in terms of how information is transferred to the brain. Which is primarily in 1D (audio) or 2D (sensory surface) layouts.

From that we learn a sense of how our body is arranged very early in life.

EDIT: I was wrong about one thing. Muscle nerve endings are distributed throughout the muscle volume. So 3D positioning is not sensed, but we do have sensor locations distributed in rough and malleable 3D topologies.

Those don’t give us any direct 3D positioning. In fact, we are notoriously bad at knowing which individual muscles we are using. Much less what feeling correspond to what 3D coordinate within each specific muscle, generally. But we do learn to identify anatomical locations and then infer positioning from all that information.

reactordev 9/3/2025||

Your analysis is incorrect again. Having sensors spread out across a volume is, by definition, measuring 3D space. It’s a volume. Not a surface. Humans are actually really good at knowing which muscles we are using. It’s called body sculpting. Lifting. Body building. And all of that. So nice try.

Nevermark 9/4/2025||

Ah good point. 3D in terms of anatomy, yes.

Then the mapping of those sensors to the bodies anatomical state in 3D space is learned.

A surprising number of kinds of dimension involved in categorizing sensors.

reactordev 9/5/2025||

Agreed :)

It doesn’t make it any less 3d though. It’s the additive sensing of all sensors within a region that gives you that perception. Fascinating stuff.

echelon 9/3/2025||||

The GPCRs [1] that do most of our sense signalling are each individually complicated machines.

Many of our signals are "on" and are instead suppressed by detection. Ligand binding, suppression, the signalling cascade, all sorts of encoding, ...

In any case, when all of our senses are integrated, we have rich n-dimensional input.

- stereo vision for depth

- monocular vision optics cues (shading, parallax, etc.)

- proprioception

- vestibular sensing

- binaural hearing

- time

I would not say that we sense in three dimensions. It's much more.

[1] https://en.m.wikipedia.org/wiki/G_protein-coupled_receptor

2OEH8eoCRo0 9/3/2025||||

And the brain does sensor fusion to build a 3d model that we perceive. We don't perceive in 2d

There are other sensors as well. Is the inner ear a 2d sensor?

AIPedant 9/3/2025||

Inner ear is a great example! I mentioned in another comment that if you want to be reductive the sensors in the inner ear - the hairs themselves - are one dimensional, but the overall sense is directly three dimensional. (In a way it's six dimensional since it includes direct information about angular momentum, but I don't think it actually has six independent degrees of freedom. E.g. it might be hard to tell the difference between spinning right-side-up and upside-down with only the inner ear, you'll need additional sense information.)

AIPedant 9/3/2025|||

It is simply wrong to describe touch and proprioception receptors as 2D.

a) In a technical sense the actual receptors are 1D, not 2D. Perhaps some of them are two dimensional, but generally mechanical touch is about pressure or tension in a single direction or axis.

b) The rods and cones in your eyes are also 1D receptors but they combine to give a direct 2D image, and then higher-level processing infers depth. But touch and proprioception combine to give a direct 3D image.

Maybe you mean that the surface of the skin is two dimensional and so is touch? But the brain does not separate touch on the hand from its knowledge of where the hand is in space. Intentionally confusing this system is the basis of the "rubber hand illusion" https://en.wikipedia.org/wiki/Body_transfer_illusion

Nevermark 9/3/2025||

I think you mean 0D for individual receptors.

Point (I.e. single point/element) receptors, that encode a single magnitude of perception, each.

The cochlea could be thought of 1D. Magnitude (audio volume) measured across 1D = N frequencies. So a 1D vector.

Vision and (locally) touch/pressure/heat maps would be 2D, together.

AIPedant 9/3/2025||

No, the sensors measure a continuum of force or displacement along a line or rotational axis, 1D is correct.

Nevermark 9/3/2025||

That would be a different use of dimension.

The measurement of any one of those is a 0 dimensional tensor, a single number.

But then you are right, what. is being measured by that one sensor is 1 dimensional.

But all single sensors measure across a 1 dimensional variable. Whether it’s linear pressure, rotation, light intensity, audio volume at 1 frequency, etc.

glitchc 9/3/2025|||

It's simple: Those who think that human perception is 2D are wrong.

rubzah 9/3/2025|||

It's 2D if you only have one eye.

__alexs 9/3/2025|||

It's not even 2D with one eye. We can estimate distance purely from your eyes focal point.

yeoyeo42 9/3/2025||

with one eye you have temporal parallax, depth cues (ordering of objects in your vision), lighting cues, relative size of objects (things further away are smaller) together with your learned comparison size etc.

supermatt 9/3/2025|||

Nope. There are a number of monocular depth cues: https://en.wikipedia.org/wiki/Depth_perception#Monocular_cue...

imtringued 9/3/2025|||

2D models don't have object persistence, because they store information in the viewport. Back when OpenAI released their Sora teasers, they had some scenes where they did a 360° rotation and it produced a completely different backdrop.

reactordev 9/3/2025|||

You have two eyes for a reason. The world is not 2D.

b3lvedere 9/3/2025||

The world is also not just human perception.

https://theoatmeal.com/comics/mantis_shrimp

reactordev 9/5/2025||

The fucking Mike Tyson’s of the seas.

hambes 9/3/2025|||

you're telling me my depth perception is not creating a 3D model of the world in my brain?

kylebenzle 9/3/2025||

[dead]

KaiserPro 9/3/2025||

So a lot of text to "world" engines have been basically 2d, in that they create a static background and add sprites in to create the illusion of 3D.

I'm not entirely convinced that this isn't one of those, or if its not it sure as shit was trained on one.