Posted by palashawas 20 hours ago
> To make the comparison apples-to-apples, we rewrote the vision prompt as an explicit UI walkthrough, naming the sidebar items, tabs, and form fields the agent should interact with at each step. Fourteen numbered instructions covering the navigation the agent had failed to figure out on its own.
This is a model problem, though. Because the model failed to understand it could scroll, you forced it to consume multiples of the tokens. Could you come up with an alternative here?
Do you know what the vision model was trained on? Because often people see “vision model” and think “human-level GUI navigator” when afaik the latter has yet to be built.
The models frequently failed for many reasons on earlier runs, and the browser-use prompt ended up being pretty granular. I'll add a couple of runs that include a scroll instruction to the repo today and see how that compares
Pretty hard to guess what Anthropic trained sonnet on, but general multimodals are what people are using to drive similar tools today, whether GUI-trained or not, so the comparison still holds, for now
I can see the appeal in pixel route given universality but wow that seems ugly on efficiency
Not possible on wayland, maybe on X11 protocol?
or even one based on PDF like OSX: https://en.wikipedia.org/wiki/Quartz_2D
There are usecases where the vision agent is the more obvious, or only choice though, e.g. prorprietary/locked-down desktop apps that lack an automation layer.
Try playing fruit ninja via text and llm toolcalls though
I don't think any new app should ever be specifically designed for AI to interact with them through computer use