Waypoint-1: Real-Time Interactive Video Diffusion from Overworld

Posted by avaer 1/23/2026

Waypoint-1: Real-Time Interactive Video Diffusion from Overworld(huggingface.co)

92 points | 22 comments

ecmulli 1/23/2026|

I dont have a big enough GPU but I was able to play around with the model using this plugin https://github.com/daydreamlive/scope-overworld via Runpod - very cool!

ericxtang 1/24/2026||

Hey - Eric from Daydream here. Thanks for linking our project! Check out this tutorial video that Vibor made - https://www.youtube.com/watch?v=cW3ZGxZNX0g

cmuir 1/24/2026||

Wow, very cool. Starring now.

roskelld 1/23/2026||

The context seemed to last a few seconds. I went from a mock up screenshot of a fantasy video game, complete with first person weapon. Then as I moved forward the weapon became part of the scenery and the whole world blurred and blended until it became some sort of sci-fi abstract space. Spinning the camera completely changed look and style.

I ended up with a UI that closely resembled the Cyberpunk 2077 one complete with VO modal popup. I guess it must have featured a lot in the training data.

Really not sure what to make of this, seems to have no constraints on concept despite the prompt (I specifically used the word fantasy), no spatial memory, no collision, or understanding of landscape features in order to maintain a sense of place.

avaer 1/23/2026||

Accurate to my experience hacking on this model today, but I don't think anyone's blowing smoke about it.

Thinking back to where GPT-3 was 5 years ago, I can't help but be a little bit excited. And unlike GPT-3 this is Apache.

Grimblewald 1/24/2026||

I'd put this closer to gpt2 tbh. GPT3 was already quite impressive and functional. We haven't come particularly far since imo. More small noticable steps, but no significant jumps.

cheevly 1/24/2026||

Bro, go try GPT-3 right now. You are off your rocker. Good luck even getting it to stop repeating the same word indefinitely.

Grimblewald 1/24/2026||

Skill issue. The below is an excerpt from some of my early notes from a project spanning 2020-2022 using the babbage model, not even davinci. This is pre-"chat" or instruction tuned models. I have many more notes, including more code generation, mermaid graphs, spatial reasoning, in context learning examples, examples of using it in agentic ways to write a medium size code project that worked, etc. I have these historic records and can tell you improvements have been minimal. Do you have any historical records or just vibes, feelings, and a lack of skill? I can confidently tell you that what has improved is size of model required to get decent output (smaller) and more importantly, operator skill required to get decent output (lower). Beyond that, it is same old.

In the below, lines which are questions were me, paragraph following was gpt3-babbage. No stop tokens used, just manually breaking off generation when it began talking as me.

---

how do mobile phones work?

Mobile phones work by connecting to cellular networks, which are made up of a series of cell towers that are spread out across a wide area. When you make a call or use data, your phone will connect to the closest cell tower, which then connects you to the wider network. This connection allows you to access features such as calling and texting, as well as data services such as web browsing and streaming.

how do cell towers work?

Cell towers are the infrastructure used in cellular networks to transmit wireless signals. They are typically tall structures that house antennas and other equipment that enable wireless communications. Cell towers work by receiving and transmitting wireless signals over a certain frequency, which is then picked up by mobile devices like smartphones. The tower then passes the signal to a base station, which then routes it to its destination.

how do wireless signals work?

Wireless signals are transmitted using radio waves. These radio waves are generated by a transmitter and then received by a receiver. The transmitter converts an electrical signal into an electromagnetic wave, which is then sent out into the air and picked up by the receiver. The receiver then interprets the signal and converts it back into an electrical current or data that can be used by the device.

what are radio waves?

Radio waves are a form of electromagnetic radiation with wavelengths ranging from as long as one meter to as short as one millimeter. Radio waves are used for a variety of communication purposes, including radio and television broadcasting, cellular phones, radar, and satellite communications.

what is electromagnetic radiation?

Electromagnetic radiation (EM radiation or EMR) is a form of energy release in the form of waves or particles that travel through space. It includes visible light, infrared, ultraviolet, X-rays, microwaves and radio waves. All of these types of radiation are produced by the sun and other natural sources on Earth.

what is energy in the context of electromagnetic radiation?

Energy in the context of electromagnetic radiation is the amount of work done by a photon, which is equal to its frequency multiplied by Planck's constant. It is also referred to as radiant energy and is closely related to intensity and power.

what is a photon?

A photon is a particle of light, which is an elementary particle that is the basic unit of all forms of electromagnetic radiation, including light. Photons have no mass and no electric charge and travel at the speed of light.

give me the formula for photonic energy in latex format?

$$\frac{hc}{\lambda}$$

explain this formula $$\frac{hc}{\lambda}$$This formula is used to calculate the energy of a photon. It is calculated by dividing the product of Planck's constant (h) and the speed of light (c) by the wavelength (λ) of the photon. The result is the energy (E) of the photon in joules.

write a python script that solves for lambda in this formula

```python # Import the math library to use sqrt import math

# Define constants h = 6.626e-34 # Planck's constant in m2 kg / s c = 3.0e+8 # Speed of light in m/s E = 4.5e-19 # Energy of photon in joules

# Calculate lambda using the formula lambda = hc/E lambda_val = (hc)/E

# Print the result print('The wavelength of the photon is {} meters.'.format(math.sqrt(lambda_val))) ```

write a python script that solves this formula

```python

def calculate_energy(h, c, λ): energy = (h c) / λ return energy

h = 6.626e-34 # Planck's constant in joules/second c = 3e8 # Speed of light in m/s λ = 500e-9 # Wavelength of the photon in m

energy = calculate_energy(h, c, λ) print("The energy of the photon is", energy, "joules.")

```

cheevly 5 days ago||

Do you not see the repetition here? You're literally proving my point. Yes, I know you can apply token penalties.

lcastricato 1/23/2026||

BTW, there is a gradio space here:

https://huggingface.co/spaces/Overworld/waypoint-1-small

And our streamed version:

https://overworld.stream

avaer 1/23/2026||

If you think this is cool you might also be interested in https://github.com/MineDojo/NitroGen which is kind of the opposite (and complimentary).

Plankaluel 1/23/2026||

An RTX 5090 for 20-30fps for the small model: That is not as unreasonable as I had feared :D

dsrtslnd23 1/23/2026||

10,000 hours training data seems quite low for a world model?

lcastricato 1/23/2026|

60fps training data goes a long way ;)

echelon 1/23/2026||

You guys have my support. I'll pay you when you open up payments.

We need open source world models.

khimaros 1/23/2026||

this is like an open weights version of DeepMind's Genie

lcastricato 1/23/2026||

Hi,

Louis here. CEO of overworld. Happy to answer questions :)

lostmsu 1 day ago||

What kind of compression is used? (the blog post says model is trained on compressed frames)

Is there an architecture diagram of the model and/or a minGPT-style implementation?

rcv 1/24/2026|||

Looks like your login is busted. I get the following when trying to log in with Google or Github:

``` { "code": "REDIRECT_URL_NOT_WHITELISTED", "error": "Redirect URL not whitelisted. Did you forget to add this domain to the trusted domains list on the Stack Auth dashboard?" } ```

anotheryou 1/23/2026|||

Wouldn't a little google maps style navigation solve latency mostly?

Project on to a sphere, crop a little bit, do onset of motions by rotating or moving in the sphere

dsrtslnd23 1/23/2026||

great work! Will the medium model be also open/apache-licensed?

lcastricato 1/23/2026||

Medium is going to bc cc by sa nc 4.0. We may reevaluate in the future and make it more lenient. Small is meant to be the model for builders and hackers.