Computer Use is 45x more expensive than structured APIs

Posted by palashawas 18 hours ago

Computer Use is 45x more expensive than structured APIs(reflex.dev)

402 points | 230 commentspage 2

janalsncm 17 hours ago|

Wall clock time tells me everything I need to know. The vision model took almost 20 minutes to do the thing that Sonnet did in 20 seconds.

The only reason you wouldn’t choose an API is if it wasn’t viable.

antves 17 hours ago||

I think one main point is that not all "computer use" is the same, the harness and agentic experience matters a lot. A poorly designed API experience can actually be _less_ efficient than a well designed browser or computer use experience

In particular, the vision-based approach used in the evaluation has clear limitations with regard to efficiency due to its nature (small observation window, heterogeneous modality)

At Smooth we use an hybrid DOM/vision approach and we index very strongly on small models. An interesting fact is that UIs are generally designed to minimize ambiguity and supply all and only the necessary context as token-efficient as possible, and the UX is cabled up to abstract the APIs in well-understood interface patterns e.g. dropdowns or autocompletes. This makes navigation easier and that's why small models can do it, which is another dimension that must be considered

We typically recommend using APIs/MCP where available and well designed, but it's genuinely surprising how token-efficient agentic browser navigation can actually be

orliesaurus 15 hours ago||

Computer Use? Or Browser Use? IMHO big diff

The problem is that not everything from the 'past' can be accessed via APIs. It would be a fun time - remember Prism [1] - I would just run that and get all the API calls in a nice format and then replay them over and over to do things in succession.

In the new world, we have access to OpenAPI.json and whatnot, but in the world where things were built in the days pre-OpenAPI and pre-specs and best practices...I am not so sure! (and a lot of world lives then)

Alas, this works for a good chunk of things but not everything. Which is why the other technnology exists.

[1] https://stoplight.io/open-source/prism

aurareturn 18 hours ago||

In an agentic world, the OS needs to be completely rethought. For example, every single app functionality should be exposable via an API while remaining human friendly.

I think OpenAI designing their own phone is the next logical step. I hope they succeed which should bring major competition to Apple and Android.

planb 18 hours ago||

This will not happen. None of the existing apps people use daily on their phones have any incentive to support this. Social media wants the people to doomscroll, shopping apps and booking sites want to use their own dark patterns to make people believe they get a special discount if they buy _now_ and everything else just wants users to see the ads. Why on earth would they offer convenient hooks for AI chatbots?

input_sh 17 hours ago|||

It's even more fascinatingly dumb to have this discussion like 2 or so years after every major platform decided to kill any notion of 3rd party clients they used to support.

Yes, in an ideal world, that'd be great for both humans and LLMs, but we are about as far from that ideal world as we could be. You can't even do some of the "advanced actions" as a human with human-level reflexes without encountering a captcha, but sure, all of a sudden, everyone will just decide to make their bread and butter that is data easier to explore via an LLM.

aurareturn 16 hours ago||||

  Why on earth would they offer convenient hooks for AI chatbots?

Competition. If I ask my OS-level AI assistant to find a social media reel about a elephant dancing, the social media app that exposes a set of APIs for an AI agent might get used more.

Watch how fast Meta adds this if a new hot shot social media app succeeds by designing for AI agents controlled by users.

JambalayaJimbo 15 hours ago|||

>Competition. If I ask my OS-level AI assistant to find a social media reel about a elephant dancing, the social media app that exposes a set of APIs for an AI agent might get used more.

This is the exact opposite of what will happen (and in fact what has happened). Reddit is suing Perplexity right now for scraping.

Meta will not serve content to some other app for free - for what benefit? They will not see advertising data.

jasomill 1 hour ago|||

Who said anything about free?

Advertising isn't the only possible business model.

And profit isn't the only possible motive to provide a service.

aurareturn 3 hours ago|||

Scraping and asking an agent is different.

swiftcoder 16 hours ago|||

Having used a chatbot to find a reel Meta was censoring from search in the past... I'm not sure how well the incentives align

tdeck 6 hours ago||||

Actually this more or less describes how accessibility APIs work.

jasomill 2 hours ago||

Not really. For the most part, accessibility APIs provide programmatic interfaces to user interfaces, application APIs provide semantically meaningful interfaces to application functionality.

A closer analogue would be AppleScript, or rather, the underlying Apple Event and Open Scripting Architecture functionality supplied by the OS to support AppleScript, that allowed applications to expose these interfaces along with metadata documenting them, and for external tools to record manually performed tasks across applications as programs expressed in terms of these interfaces to make them easier to use (this last bit, while not strictly required, is convenient, and especially useful for less technical users).

If you're familiar with VBA in Microsoft Office applications, sort of like that, except with support provided by OS APIs that could be used by any application that chose to implement scripting support, official guidance from Apple suggesting that all well-designed applications should be scriptable and recordable, and application design patterns and frameworks designed with scriptability and recordability in mind.

Note that I use the past tense here, despite AppleScript still being available in macOS, because it is not well-supported by modern applications.

https://dl.acm.org/doi/epdf/10.1145/1238844.1238845

jackphilson 17 hours ago||||

because the social media sites that do will outcompete once people get personal AI coaches that tell them to use technology that is better for them.

donaldjbiden 17 hours ago||

How is an AI posting on your social media better for you?

tdeck 6 hours ago|||

People on LinkedIn who are trying to build their "personal brand" seem to favor it. In fact, that's basically all the platform is these days.

kaashif 17 hours ago|||

It's not, but token peddlers will say it is. It's good to interact with everything through buying tokens.

charcircuit 17 hours ago||

And how will a token peddler's social media company survive after the hype runs out?

ai_fry_ur_brain 17 hours ago|||

These people are delusional and want to build a world thats convenient for them to accomplish things lazily with LLMs.

There are no shortcuts in life and its just expensive text autocomplete.

"Lets spin up $750k in GPUs full throttle to scrape a web page with my $200.00 CC subscription."

Everyone is delusional.

awongh 17 hours ago|||

At the beginning of the internet we were promised the free flow of digital information between computers, peer-to-peer. What we got was silos of content each fighting each other to make sure that the silos stay intact with DRM.

I could imagine an AI future where agentic shopping companies who promise me the best deal are pitted against Walmart and Amazon, trying to algorithmically squeeze me for $2 more- just two bots playing a cat and mouse game to save me a few bucks.

For some reason a lot of tech ends up in these antagonistic monopolies- Apple wants to sell privacy aware devices as a product feature, Google wants give you mail and maps, but sell your data. Despite any appearances neither give a shit about you, even if you benefit from the dynamic.

pmontra 17 hours ago|||

I still have to understand what my AI agents could do that I don't want to do myself. Buy stuff? No thanks, I want to see what I buy. I think that they are 99% a solution in search of a problem.

sciencejerk 7 hours ago|||

My family (unfortunately) uses InstaCart and probably 15% of items are a shitty "replacement" "not what I wanted". For time sensitive items, having the shitty replacement item NOW is better than having to wait for the "item I actually wanted", so we often just accept the inferior product. This is a dark pattern that I could see AI adopting -- it buys tons of cheap crap you didn't want, some of it was right, and you're left with a mess of returns to sort out, esp. if those returns require you to physically take some sort of action like physically returning the item to the store

sbrother 17 hours ago|||

Same. Well the biggest thing I don't want to do that they could help with is work. But in the cases where it can do that for me, there's no world where that benefit goes to me rather than my employer.

pmontra 16 hours ago||

Well, that's the very nature of the employer / employee relationship. In my case I write software for my customers and I trade time for money. If I use an AI to write code two times faster my daily rate doesn't double. However I can keep my costumers.

That's only another step in the path I experienced since the 80s, when I had to type every single character because there was no auto complete, no command line history, very few libraries. I was very good at writing trees, hash tables, linked lists and so was everybody else. Nobody would hire me if I were that slow at writing code today.

tikhonj 18 hours ago|||

Everything exposed programmatically would have been great even without agents—the NixOSes and Emacses of the world show just how amazing a fully flexible and programmable world would be—but I'm glad that the advent of AI is getting people invested in this vision :P

joshstrange 18 hours ago|||

> I think OpenAI designing their own phone is the next logical step. I hope they succeed which should bring major competition to Apple and Android.

This is not going to happen, or if it does it will just be Android (like Samsung reskins/modifies it) and it will certainly use Google Play Services.

mtoner23 18 hours ago|||

Openai should not design a phone... They should try making money first

sophacles 18 hours ago||

Nonsense. Don't you know how bubbles work? Everyone does massive rushes for all the low hanging and medium hanging fruit. The the bubble pops and the randomized carnage of companies big and small being destroyed is sifted through by the next wave of companies actually intended to make money.

The good ideas and the bad ideas don't signal success in a bubble, nor does making money or not. Its random and any notion of "this was a good business model and that was bad" is post-hoc rationalization. The number of people who make fun of pets.com but order from chewy.com is a prime example of this.

1659447091 9 hours ago|||

> In an agentic world, the OS needs to be completely rethought

Isn't that what Apple is doing with its Foundation Models Framework?[0] Developers can integrate Apples on device llm that includes things like tool calling. I don't write Apple specific apps so not sure what can actually be done with it, but it looks promising and something Apple already seems to think things are headed.

> I think OpenAI designing their own phone is the next logical step

ChatGPT is already integrated into Apple Intelligence for those that want to use that instead of Apples model -- I don't see OpenAI trying to change lanes into phone making when they can focus on doing what they know while collecting a large check from Apple

[0] https://developer.apple.com/documentation/foundationmodels

switchbak 17 hours ago|||

"In an agentic world, the OS needs to be completely rethought" - if AI is progressing as fast as we think it is, I don't think we'll be interested in waiting for the world to rebuild all the legacy tooling from the OS up. For new stuff, that'd be great.

I imagine the AIs will get a lot better at intercepting things at an intermediate level - API calls under the hood, etc. Probably much better (and cheaper) vision abilities, and perhaps even deeper integration into the machine code itself. It's really hard to anticipate what an advanced model will be capable of 5 years from now.

bnyhil31-afk 11 hours ago||

Maybe not the same approach, but start with a kernel that acts as a governed membrane for everything else?

Open source research/project I have been exploring on the topic: https://aevum.build/learn/architecture/

zozbot234 16 hours ago|||

> In an agentic world, the OS needs to be completely rethought. For example, every single app functionality should be exposable via an API while remaining human friendly.

So, like a Unix system?

jasomill 2 hours ago|||

In other words, AppleScript in the late '90s.

donaldjbiden 17 hours ago|||

We used to have this. It was called OLE Automation.

andrekandre 10 hours ago||

yep, and applescript...

i'm really not sure companies will allow their apps to be automated so easily, and the reason is api abuse (think of a saas where you can upload file attachments for example); you'd either end up banned or throttled pretty fast, and in the end the company will be like "cost > opportunity" and just close it off (and its like this already, llms just make this worse)

lazide 18 hours ago|||

This is like insisting - after the problem turns out to be harder than thought - that the worlds roads need to be completely redone to make them self driving friendly, so self driving can work.

Isn’t the whole ‘promise’ of AI that it doesn’t need any of those things?

airstrike 16 hours ago|||

It doesn't need to be mobile. The AI-first OS will be headless, undoubtedly.

Humans would be the second-class users of said OS, which can generate UIs on demand as needed.

I've thought about this quite a bit. Started implementing as a side project, but I have too many side projects at the moment...

throwaway27448 18 hours ago|||

We have a much better chance of an ai-addressable Harmony OS version than of OpenAI making a serious competitor.

FirestarAlpha 17 hours ago|||

That’s actually what the Reflex plugin behind the APIs in the benchmark does. It creates APIs from your app’s event handlers, thereby providing a stateful way for agents to navigate apps.

It’s why we did this benchmark :) - reflex team member

jnwatson 16 hours ago|||

Android is working on it. See AppFunctions.

https://developer.android.com/ai/appfunctions

ssl-3 15 hours ago|||

We'll just close the loop with a systemd MCP, set the shell to /usr/bin/codex, and find some other way to pay the bills.

Perfect.

CodingJeebus 17 hours ago|||

One of the most seductive (and destructive) forces in software is the desire to rewrite from scratch because rewrites never, ever, ever go as planned. With AI, we're now thinking it's a good idea to rewrite the entire platform from the ground-up. Wild.

convolvatron 17 hours ago||

except every single piece of progress that we have is the result of trying to do things a different way. so unless you really think we've reached the pinnacle of operating system design, there has to be some room for this?

CodingJeebus 17 hours ago||

There's a very big difference between building onto an existing system and rewriting from the ground up. I'm not opposed to making progress and trying things differently, but saying things like "we need to completely rethink the operating system" is like saying "we need to completely redesign New York City". The most effective progress is incremental, not throwing the old system away wholesale.

The modern javascript ecosystem is a perfect example of what happens when everyone tries to rebuild from scratch and it's a nightmare.

reorder9695 18 hours ago|||

Presumably on Linux at least apps could just expose a DBus API? The machinery for this is already in place as far as I can tell.

dummydummy1234 17 hours ago|||

Why not use the same acc disability features?

pier25 17 hours ago|||

And when the agent fucks up badly (as we've seen over and over again) who will be held accountable? The user?

titzer 10 hours ago|||

The GUI was a mistake. Long live the shell!

shiandow 17 hours ago|||

Ah yes. The trains everywhere approach to self driving cars.

dist-epoch 17 hours ago|||

The future is "dark OSs" - OSes with no human users.

wartywhoa23 17 hours ago||

Launched to nuclear fanfare on August 29th.

QuercusMax 18 hours ago|||

Lots of apps actually do have all their functionality exposable via an API - but it's an internal API that's hidden from the user.

Rekindle8090 17 hours ago||

[dead]

Frannky 7 hours ago||

I want to just talk to the Mac and have it do things. I tried computer use and other alternatives, but the latency made it unusable.

I want to be able to control both Mac, apps and the browser. I also need it to figure out things by itself given a goal.

Claude Code with the --chrome flag is kind of good, but it's too slow. I wanted to try faster APIs, like the one hosted on Cerebras, but it's too expensive.

Any solution I might be missing?

jasomill 2 hours ago|

Do you want to do something that can't be done through AppleScript, macOS accessibility APIs, and something like Puppeteer to control the browser?

Or something you don't understand how to do manually?

Because I guess I don't understand the attraction of using an LLM for system automation where existing interfaces exist, other than as a form of documentation, or to write code using these interfaces.

johnsmith1840 15 hours ago||

Text based web browsing? Would love the comparison there. Tons of systems have a dom translation layer. I'm building around this with the concept of turn a webpage into text for an agent to use directly. I actually had to move away from haiku not because of accuracy problems but because it operated the browser too fast for a human to follow what it was doing. The real loss here are bespoke webapps like a figma or google docs which are near impossible to see what they are doing via the dom.

To me the browser is a translation layer. Working on the browser directly while hard enables big advantages on compatibility. The only thing I miss as of now which is on the todo is ocr of the images in the browser into text out. But an api would need to do that anyways to work.

The main loss in my view of pure API based is, where do you get the data? We won't replicate human work without seeing that done. Humans work in the UI that's it. Computer use to me is the promise of being able to replicate end to end actions a human does. API can do that in theory but the data to do that is also near impossible to collect properly.

_boffin_ 18 hours ago||

What i don't understand about "computer use" is why they're not just grabbing the window handles and storing them to determine what should be clicked after the first few iterations of using that a specific application. if a new case / path / whatever is found, drop back to screen grabbing and bounding boxes and then figure the handles that are there and store after.

idk.. not really thought out too much, but has to be better

etothet 15 hours ago||

Vision has a long way to go. I remember trying an early version of AWS's Nova Act and laughed at how slow it was. And a few months later it hadn't really seemed to improve that much.

Recently, I asked Claude to log into my local grocery store chain's website and add all of the items from my shopping list to a cart. It was hilariously slow, but it did get the job done.

Unless I missed it, the article doesn't explictly mention speed in the copy, but the results do show a 17 minute (!!!) total time for the vision agent vs. 0.5s - 2.8s for the API approach.

A big part of the challenge with vision is that to manipulate the DOM, you first have to be sure the entire (current) DOM is loaded. In my experience this ends up in adding a lot of artificial waits for certain elements to exist on the page.

nijave 10 hours ago|

Would a lightweight motion detection algorithm work there?

Thinking of Frigate NVR that does motion > object detection > scene description

Where you build up to progressively slower and more expensive algorithms i.e. there's motion > it's a person > here's what the person is doing

_heimdall 9 hours ago|

We gave up on structured APIs 20+ years ago when JSON RPC largely replaced XML REST. You can do REST in many different formats, it mainly just needs to be structured data and self-discoverable.

Had we not made that wrong turn, LLMs and humans would have a much easier time reasoning about APIs they don't directly control.

ex-aws-dude 9 hours ago||

Yeah but why would we keep that around for 20 years with no good use case

_heimdall 8 hours ago||

Why do you assume theres no good use case?

trpc, grpc, etc are all attempts to add schemas back into JSON. Swagger, OpenAPI, etc are attempts to add discover ability back into JSON-based RPC APIs.

MCPs fall in here as well, which attempt to add schemas and discover ability back in where our APIs aren't actually RESTful.

zepolen 2 hours ago||

> You can do REST in many different formats, it mainly just needs to be structured data and self-discoverable.

REST has nothing to do with structured data or discoverability.

More comments...