Posted by ilreb 4 hours ago
Seems like this might make that a lot less painful. And if not off the bat, with some minimal tuning or even just good prompting.
I assumed at first that it was trained on synthetic data, but they actually went and deployed real physical hosts and virtual machines (e.g. Ubuntu, macOS, and Android) and browsers. They ran agentic systems on these continuously and recorded the actual, real-world interactions.
So it's an LLM that infers next state, or outcome,as structured data e.g. literal HTML code, UI view hierarchies, or accessibility trees.
> Figure 1: Overview of Qwen-AgentWorld. Top: Qwen-AgentWorld is a unified native language world model across seven domains. Bottom: We explore two complementary strategies for applying world modeling to enhance language agents (mainly using the 35B-A3B model as agent): Decouple and Unify , where the world model serves as the environment simulator and agent foundation model, respectively.
Where is the mistake?
The bars above the label "Infinite Real-World Envs" show growth for example from approx 42 to 55 but the red label says "+7.1". It's wrong for all of them.