Posted by c1b 1 day ago
Requires WebGPU.
avg500 -4.6 last 500 episodes
peak 3959.3 best window
roll/s 20.68 20-step avg
progress 4388 562749 episodes
But at around 4K avg score you should see it solve the env almost every time.
Just a demo :) optimized for speed over stability.
Reward structure: Step: -1 Dot: +100 Win: +1000 so ~4k is max theoretical score on 6x6.
Alternatively it might be a problem with the scoring model in the end game.
That is the point, there is nothing on an intention that we cannot improve, the goal here is no more than 1 unique iteration of the same path
I noticed that if you go from training to watch and then back, the training temporarily drop significantly in score.
trained and made a viz for the model and then made it displace text.
should probably do a proper write-up:https://x.com/i/status/2038367016969724259