Posted by mingtianzhang 9/3/2025
I wonder if you could loop back the last frame of each video to extend the generated world further. Creating a kind of AI fever dream
I also wouldn't be surprised if their Street View cars / people record video instead of stills these days. Assuming they started capturing stuff in 2007 (and it was probably a lot earlier), storage technology has improved at least tenfold in terms of storage (probably more), video processing too.
Lidar is direct measurement
Good grief orange site sometimes I swear.
Doesn't really matter if an imgTo3d script gets a face's depth map inverted, kinda problematic if your car doesn't think there's something where there is.
The linked Github page has a comparison with other world models...
We're about to see next gen games requiring these as minimum system requirements...
So if you're the first to approach the Opera House, it would ask the engine for 3D models of the area, and it would query its image database, see the fancy opera house, and generate its own interpretation.. if there's no data (e.g. a landscape in the middle of Africa), it'd use the satellite image plus typical fauna of the region..
It could work, but they would have to both write unique prompts for each NPC (instead of "generate me 100 NPC personality prompts") and limit the possible interactions and behaviours.
But, emergent / generative behaviour would be interesting to a point. There's plenty of roguelikes / roguelites where this could work in, given their generative behaviours.
Cool, I guess… If you have tens of thousands of $ to drop on a GPU for output that’s definitely not usable in any 3D project out-of-the-box.
Is more approachable than one might think, as you can currently find two of these for less than 1,000 USD.
Fixed question: Thanks a lot for the feedback that human perception is not 2D. Let me rephrase the question: since all the visual data we see on computers can be represented as 2D images (indexed by time, angle, etc.), and we have many such 2D datasets, do we still need to explicitly model the underlying 3D world?
And of course it really makes more sense to say human perception is 3+1-dimensional since we perceive the passage of time.
None of these world models have explicit concepts of depth or 3D structure, and adding it would go against the principle of the Bitter Lesson. Even with 2 stereo captures there is no explicit 3D structure.
The model can learn 3D representation on its own from stereo captures, but there is still richer, more connected data to learn from with stereo captures vs monocular captures. This is unarguable.
You're needlessly making things harder by forcing the model to also learn to estimate depth from monocular images, and robbing it of a channel for error-correction in the case of faulty real-world data.
I still don't understand what the bitter lesson has to do with this. First of all, it's only a piece of writing, not dogma, and second of all it concerns itself with algorithms and model structure itself, increasing the amount of data available to train on does not conflict with it.
The 3rd dimension gets inferred from that data.
(Unless you have a supernatural sensory aura!)
You are inferring 3D positions based on many sensory signals combined.
From mechanoreceptors and proprioceptors located in our skin, joints, and muscles.
We don’t have 3-element position sensors, nor do we have 3-d sensor volumes, in terms of how information is transferred to the brain. Which is primarily in 1D (audio) or 2D (sensory surface) layouts.
From that we learn a sense of how our body is arranged very early in life.
EDIT: I was wrong about one thing. Muscle nerve endings are distributed throughout the muscle volume. So 3D positioning is not sensed, but we do have sensor locations distributed in rough and malleable 3D topologies.
Those don’t give us any direct 3D positioning. In fact, we are notoriously bad at knowing which individual muscles we are using. Much less what feeling correspond to what 3D coordinate within each specific muscle, generally. But we do learn to identify anatomical locations and then infer positioning from all that information.
Then the mapping of those sensors to the bodies anatomical state in 3D space is learned.
A surprising number of kinds of dimension involved in categorizing sensors.
It doesn’t make it any less 3d though. It’s the additive sensing of all sensors within a region that gives you that perception. Fascinating stuff.
Many of our signals are "on" and are instead suppressed by detection. Ligand binding, suppression, the signalling cascade, all sorts of encoding, ...
In any case, when all of our senses are integrated, we have rich n-dimensional input.
- stereo vision for depth
- monocular vision optics cues (shading, parallax, etc.)
- proprioception
- vestibular sensing
- binaural hearing
- time
I would not say that we sense in three dimensions. It's much more.
[1] https://en.m.wikipedia.org/wiki/G_protein-coupled_receptor
There are other sensors as well. Is the inner ear a 2d sensor?
a) In a technical sense the actual receptors are 1D, not 2D. Perhaps some of them are two dimensional, but generally mechanical touch is about pressure or tension in a single direction or axis.
b) The rods and cones in your eyes are also 1D receptors but they combine to give a direct 2D image, and then higher-level processing infers depth. But touch and proprioception combine to give a direct 3D image.
Maybe you mean that the surface of the skin is two dimensional and so is touch? But the brain does not separate touch on the hand from its knowledge of where the hand is in space. Intentionally confusing this system is the basis of the "rubber hand illusion" https://en.wikipedia.org/wiki/Body_transfer_illusion
Point (I.e. single point/element) receptors, that encode a single magnitude of perception, each.
The cochlea could be thought of 1D. Magnitude (audio volume) measured across 1D = N frequencies. So a 1D vector.
Vision and (locally) touch/pressure/heat maps would be 2D, together.
The measurement of any one of those is a 0 dimensional tensor, a single number.
But then you are right, what. is being measured by that one sensor is 1 dimensional.
But all single sensors measure across a 1 dimensional variable. Whether it’s linear pressure, rotation, light intensity, audio volume at 1 frequency, etc.
I'm not entirely convinced that this isn't one of those, or if its not it sure as shit was trained on one.