Marble: A Multimodal World Model

Posted by meetpateltech 5 hours ago

Marble: A Multimodal World Model(www.worldlabs.ai)

135 points | 25 comments

jtfrench 9 minutes ago|

I like that they distinguish between the collider mesh (lower poly) and the detailed mesh (higher poly).

As a game developer I'm looking for:

• Export low-poly triangle mesh (ideally OBJ or FBX format — something fairly generic, nothing too fancy) • Export texture map • Export normals • Bonus: export the scene as "de-structured" objects (e.g. instead of a giant world mesh with everything baked into it, separate exports for foreground and background objects to make it more game engine-ready.

Gaussian splats are awesome, but not critical for my current renderers. Cool to have though.

jtfrench 7 minutes ago||

Is Marble's definition of a "world model" the same as Yann LeCun's definition of a world model? And is that the same as Genie's definition of a world model?

keyle 3 hours ago||

I'm floored. Incredible work.

also check out their interactive examples on the webapp. It's a bit more rough around the edges but shows real user input/output. Arguably such examples could be pushed further to better quality output.

e.g. https://marble.worldlabs.ai/world/b75af78a-b040-4415-9f42-6d...

e.g. https://marble.worldlabs.ai/world/cbd8d6fb-4511-4d2c-a941-f4...

dcl 33 minutes ago||

What happens when you prompt one of these kind of models with de_dust? Will it autocomplete the rest of the map?

edit: Just tried it and it doesn't, but it does a good job of creating something like a CS map.

padolsey 26 minutes ago|

>What happens when you prompt one of these kind of models with de_dust?

Presumably de_dust2

dvrp 47 minutes ago||

Isn't this a Gaussian Splat model?

I work in AI and, to this day, I don't know what they mean by “world” in “word model”.

padolsey 19 minutes ago||

Yeh I still don't think there's a fixed definition of what a world model is or in what modality it will emerge. I'm unconvinced it will emerge as a satisfying 3d game-like first-person walkthrough.

ProofHouse 10 minutes ago||

I think absolutely it will in a year

butifnot0701 29 minutes ago||

but it sounds cool

msteffen 2 hours ago||

I understand that DeepMind is working on this too: https://deepmind.google/blog/genie-3-a-new-frontier-for-worl...

I wonder how their approaches and results compare?

jtfrench 16 minutes ago||

From what I can tell, you can actually export a mesh in (paid) Marble, whereas I haven't seen mesh exports offered in Genie 3 yet (could be wrong though).

echelon 2 hours ago||

Genie delivers on-the-fly generated video that responds to user inputs in real time.

Marble renders a static Gaussian Splat asset (like a 3D game engine asset) that you then render in a game engine.

Marble seems useful for lots of use cases - 3D design, online games, etc. You pay the GPU cost to render once, then you can reuse it.

Genie seems revolutionary but expensive af to render and deliver to end users. You never stop paying boatloads of H100 costs (probably several H100s or TPU equivalents per user session) per second.

You could make a VRChat type game with Marble.

You could make a VRChat game with Genie, but only the billionaires could afford to play it.

To be clear, Genie does some remarkably cool things. You can prompt it, "T-Rex tap dancing by" and it'll appear animated in the world. I don't think any other system can do this. But the cost is enormous and it's why we don't have a playable demo.

When the cost of GPU compute comes down, I'm sure we'll all be steaming a Google Stadia like experience of "games" rendered on the fly. Multiplayer, with Hollywood grade visuals. Like playing real time Lord of the Rings or something wild.

Interestingly, there is a model like Google Genie that is open source and available to run on your local Nvidia desktop GPU. It's called DiamondWM [1], and it's a world model trained on FPS gameplay footage. It generates a 10 fps 160x160 image you can play through. Maybe we'll develop better models and faster techniques and the dream of local world models can one day be realized.

[1] https://diamond-wm.github.io/

abixb 2 hours ago||

As someone with barebones understanding of "world models," how does this differ from sophisticated game engines that generate three-dimensional worlds? Is it simply the adaptation of transformer architecture in generating the 3-D world v/s using a static/predictable script as in game engines (learned dynamics vs deterministic simulation mimicking 'generation')? Would love an explanation from SMEs.

whizzter 2 hours ago||

Games are still mostly polygon based due to tooling (Even Unreal Nanite is a special variation of handling polygons), some engines have tried voxels (Teardown, Minecraft genererates polygons and would fall in the previous category as far as rendering goes) or even implict surface modes by composing SDF'y primitives (Dreams on Playstation and more recently unbound.io).

All of these have fairly "exact" representations, and generation techniques are also often fairly "exact" in trying to create worlds that won't break physics engines(big part) or rendering engines, often hand-crafted algorithms but nothing really that really stopped neural networks from being used on a higher level.

One important detail in most generation systems in games is that they are often built to be controllable to work with game-logic (think how Minecraft generates the world to include biomes,villages,etc) or more or less artist controllable.

3d scanning has often relied on point-clouds, but were heavy, full of holes,etc and have been infeasible for direct rendering for long so many methods were developed to make decent polygon meshes.

Nerf's and Gaussian splatting(GS) started appearing a few years back, these are more "approximate" and totally ignore polygon generation instead relying on quantization of the world into NN-matrix-"fields"(NERF) or fuzzy-point-clouds (GS), visually these have been impressive since they managed to capture "real" images well.

This system is built on GS since that probably meshed fairly well with neural network token and diffusion techniques for encoding inputs (images, texts).

They do mention mesh exports (there has been some research into polygon generation from GS).

If the system scales to huge worlds this could change game-dev, and there seems to be some aim with the control methods, but it'd probably require more control and world/asset management since you need predictability with existing things to produce in the long term (same as with code agents).

mountainriver 2 hours ago|||

The model is predicting what the state of the world would look like after a given action.

Along with entertainment, they can be used for simulation training for robots. And allow for imagining potential trajectories

echelon 2 hours ago||||

Marble is not that type of world model. It generates static Gaussian Splat assets that you can render using 3D libraries.

ghayes 2 hours ago|||

Whenever I see these and play with models like this (and the demos on this page), the movement in the world always feel like a dolly zoom. Things in the distance tend to stay in the distance, even as the camera moves in that direction, and only the local area changes features.

[0] https://en.wikipedia.org/wiki/Dolly_zoom

echelon 2 hours ago||

This "world model" is Image to Gaussian Splat. This is a static render that a web-based Gaussian Splat viewer then renders.

Other "world model"s are Image + (keyboard input) to Video or Streaming Images, that effectively function like a game engine / video hybrid.

ProofHouse 11 minutes ago||

RIP GTA6

girfan 2 hours ago|

This seems very interesting. Timely, given that Yann LeCun's vision also seems to align with world models being the next frontier: https://news.ycombinator.com/item?id=45897271

lofties 2 hours ago|

An established founder makes claims X is the new frontier. X receives hundreds of millions in funding. Other less established founders claim they are working on X too. VCs suffering from terminal FOMO pump billions more into X. X becomes the next frontier. The previous frontiers are promptly forgotten about.

whizzter 2 hours ago||

I think it's a bit confusing when it comes to terminology, this seems more graphics focused while I suspect that a 10 year plan as mentioned by YLC probably revolves around re-architecting AI systems to be less reliant on LLM style nets/refinements and better understand the world in a way that isn't as prone to hallucinations.

More comments...