Top
Best
New

Posted by palashawas 20 hours ago

Computer Use is 45x more expensive than structured APIs(reflex.dev)
411 points | 235 commentspage 3
svnt 19 hours ago|
> This is not a model problem. The vision agent was reasoning about a rendered page and had no signal that the page wasn't showing everything.

> To make the comparison apples-to-apples, we rewrote the vision prompt as an explicit UI walkthrough, naming the sidebar items, tabs, and form fields the agent should interact with at each step. Fourteen numbered instructions covering the navigation the agent had failed to figure out on its own.

This is a model problem, though. Because the model failed to understand it could scroll, you forced it to consume multiples of the tokens. Could you come up with an alternative here?

Do you know what the vision model was trained on? Because often people see “vision model” and think “human-level GUI navigator” when afaik the latter has yet to be built.

palashawas 19 hours ago|
This is a fair point.

The models frequently failed for many reasons on earlier runs, and the browser-use prompt ended up being pretty granular. I'll add a couple of runs that include a scroll instruction to the repo today and see how that compares

Pretty hard to guess what Anthropic trained sonnet on, but general multimodals are what people are using to drive similar tools today, whether GUI-trained or not, so the comparison still holds, for now

Havoc 19 hours ago||
Isn't it possible to somehow wire this into the window manager? Wayland or whatever. Have it speak the native window lang rather than crunch the pixels? At least for the majority.

I can see the appeal in pixel route given universality but wow that seems ugly on efficiency

lelanthran 16 hours ago||
> Isn't it possible to somehow wire this into the window manager? Wayland or whatever. Have it speak the native window lang rather than crunch the pixels? At least for the majority.

Not possible on wayland, maybe on X11 protocol?

donaldjbiden 18 hours ago|||
Wayland only has pixels. It was designed to get rid of all the X11 cruft.
QuercusMax 18 hours ago||
imagine, if you will, that we had a windowing system that's built on Postscript... lots of folks thought it was a super awesome idea, and built NeXTSTEP around it. https://en.wikipedia.org/wiki/Display_PostScript

or even one based on PDF like OSX: https://en.wikipedia.org/wiki/Quartz_2D

sheepscreek 17 hours ago||
This tracks - has been my experience exactly. Not to mention there isn’t particularly a significant lift in inaccuracy or speed. As things stand, to me it is the worst of both worlds. Expensive and inaccurate.
ai_fry_ur_brain 18 hours ago||
Its funny watching the slow mean reversion back to more deterministic tooling.
sudb 19 hours ago||
I'm pretty unsurprised that the vision agent did worse. I'd be interested in a comparison between the different tools that now exist to let LLMs drive browsers (e.g. vercel's agent-browser, the relatively new dev-browser[1], etc.)

There are usecases where the vision agent is the more obvious, or only choice though, e.g. prorprietary/locked-down desktop apps that lack an automation layer.

1. https://github.com/SawyerHood/dev-browser

palashawas 19 hours ago|
Interesting! I'll play around with agent-browser and update this article if anything comes up
cjbarber 19 hours ago||
I think of computer use as like last mile delivery. APIs and bash and such are the efficient logistics networks. Both have different benefits. Obviously, use the efficient methods when you can.
euphetar 11 hours ago||
I wouldn't call it a benchmark since it's just one sample. They do highlight a real problem, though. Computer use is immature right now and far behind language agents

Try playing fruit ninja via text and llm toolcalls though

zmmmmm 12 hours ago||
And structured APIs are about 1e9x more expensive than not invoking an LLM in the first place compared to using deterministic code to do something ... it's not like any of this is rational based on compute.
hnav 11 hours ago|
It simply doesn't fit in the token/time budget to be useful. I don't think the purveyors of these technologies care about how expensive it is as long as it's "cheap enough"
rootcage 18 hours ago||
The best use cases I've seen for computer/browser use is for legacy SaaS/Software. For example, hotels use archaic Property Management Systems (PMS) and they're required by corporate to use it and pay for it. These companies can barely keep the product alive, they definitely aren't incentivized to maintain an API. In such a case browser use agent seems to be the best (only) way.
noprocrasted 18 hours ago|
Wouldn't using a coding agent to build a screenscraper be better?
2001zhaozhao 18 hours ago|
I have only found Computer Use useful for GUI app local debugging. Presumably it will also be useful for getting around protections for external apps that don't want AI to interact with them, or for interfacing with legacy apps or those built without AI in mind.

I don't think any new app should ever be specifically designed for AI to interact with them through computer use

More comments...