Top
Best
New

Posted by palashawas 22 hours ago

Computer Use is 45x more expensive than structured APIs(reflex.dev)
424 points | 244 commentspage 5
overgard 20 hours ago|
I've been thinking of things I'd want an agent for recently. The problem is, everything I think of is something that requires using quite a few different websites, saving a lot of data securely, and working with a lot of sensitive accounts (my email, etc.)

The problem is, all the tasks are essentially: a) things agents probably just can't do, and b) things that absolutely cannot afford to be hallucinated or otherwise fucked up. So far the tasks I've thought of:

- Taxes. So it needs a lot of sensitive information to get W2's. Since I have to look up a lot of this stuff in the physical world anyway, it's not like I can just let it run wild.

- Background check for a new job. It took me 3 hrs to fill out one of them (mostly because the website was THAT bad). Being myself, I already was making mistakes just forgetting things like move in dates from 10 years ago, and having to do a lot of searching in my email for random documents. No way I'm trusting an agent with this.

- Setting up an LLC. Nope nope nope. There's a lot of annoying work involved with this, but I'm not trusting an LLM to do this.

Anyway, I guess my point is that even if an LLM was good at using my computer (so far, it seems like it wouldn't be), the kind of things I'd want an agent for are things that an LLM can't be trusted with.

peyton 20 hours ago|
It’s great at

1. things you wouldn’t otherwise bother doing

2. things where it otherwise would get stuck iterating on hacky workarounds doomed to fail

“Reverse engineer this app/site so we can do $common_task in one click”, “by the way, I’m logged in to $developer_portal, so try @Browser Use if you’re stuck”, etc.

I just had Codex pull user flows out of a site I’m working on and organize them on a single page. It found 116. I went in and annotated where I wanted changes, and now it’s crunching away fixing them all. Then it’ll give me an updated contact sheet and I can do a second pass.

I’d never do this sort of quality pass manually and instead would’ve just fixed issues as they came up, but this just runs in the background and requires 15 minutes of my time for a lot of polish.

overgard 19 hours ago||
I guess the problem I see here is that if the use case is "things I otherwise wouldn't bother doing", that's fine, but it's pretty niche. I dunno, if you're talking about a human "Agent" (like say in sports or entertainment), they'd be a trusted person to handle business matters outside of your competency (contract negotiations, etc.). I don't see AI "agents" being at all like that, they're more like an intern you need to supervise constantly.
dist-epoch 21 hours ago||
It doesn't matter.

Electron uses 10x more RAM than regular apps. But it's so convenient.

Python is 100x slower than C. It's in the top 3 of languages now.

Worse but more convenient always wins.

password4321 13 hours ago|
This is probably why MCP "code mode" (generating code once to call the MCP going forward) hasn't caught on yet... no need until the financial costs reflect reality.
moralestapia 21 hours ago||
This is obvious. The problem is that not everything has an API, while everything has a human-oriented UI.
palashawas 21 hours ago|
Right - we did this benchmark because we launched a plugin that makes APIs programmatically from an app's human-oriented UI (from the event handlers, to be specific). So any app that has a human-oriented UI now has an API.

The benchmark is a more generally interesting part of the launch materials, so I figured it had its own separate home here.

moralestapia 20 hours ago||
That is actually great, I'll definitely check it out. Thanks!
m3kw9 10 hours ago||
I did a simple computer use to search something, and used up 50% of my 5h plan limit from codex.
zephen 20 hours ago||
I find this extremely surprising.

When you think of everything it takes for an AI to use what the article calls a "vision agent" then it seems as if using a purpose-made API ought to be MANY orders of magnitude faster.

morpheos137 10 hours ago||
Who would have thunk? You know what is a great LLM agent api? bash. vast corpus, text based, already traindd in the model.
hamasho 12 hours ago||
I'm trying to use computer use and browser use (via playwright MCP) in my work. Computer use is a hit and miss (mostly miss), but playwright MCP often works very well. The downside is it takes a lot of time to complete even easy tasks.

For example, to automate processing emails, it needs to 1. go to Gmail 2. log in to Google if necessary (This often requires two step verification so it's hard to completely automating, but possible) 3. read the latest mail 4. check the content and choose the action - if needed, reply the email - if it mentions tasks, add them to the todo list - if it mentions schedules, add them to the calendar 5. repeat for all emails based on specified conditions. And each step requires dozens of DOM (a11y tree) analyzes and actions (fill username/password input, check keep logging in, click submit button, etc). Based on the model used, each step can take ~100s. So easy tasks can easily add up to tens of minutes or even hours.

For frequently used tasks, I write skills like /logging-in, /read-latest-emails, using playwright scripts and let the agent choose them And based on the email content, the agent chooses other tools like /write-reply, /add-todo, /add-event, etc, so that the model can only focus on the core tasks requiring thinking. It reduces the execution time drastically.

But it can buries important business logic in the playwright scripts, not the agent's instructions. For examples, simplified steps to add TODO items are like; 1. read the email 2. check if it's about todos, then decide to add them to Asana 3. extract and summarize the title, content, priority, due date, tags, etc. 3. access to Asana (log in if necessary) 4. check if there are similar tasks 5. if not, add the tasks This can take tens of minutes, and each step can have important business logic, like; - how to decide the priority and due date - how to choose tags based on the content - how to decide if two tasks are similar This information should be read and updated by not only developers, but managers and other teams. And if I write those steps in skills with playwright scripts, it improves the speed, but all those business logic are buried in the code, so not accessible by non-technical people. It's also error-prone because web sites often tweak the UI and scripts can stop working.

So it's very convenient if the agent processes these step once, then decides it's worth writing the playwright script so that the next time those mundate processs can be executed instantly.

With automatic skill generation, the agent decides by itself if there are workflows worth writing skills with playwright scripts, like /log-in, /extract-information, /check-similar-tasks, /add-tasks. Like Just-In-Time compiler, the skills are a byproduct of the agent instruction, all business logic are written in the agent's instruction, and doesn't need to be updated manually nor tracked in a version control system.

This can reduce a lot of execution time and API cost, and be applied other than browser automation, like computer use or any other agentic tasks if it's possible to write automation scripts for tasks not requiring thinking.

j45 13 hours ago||
Sounds like some efficiency gains will still arrive.
RobRivera 18 hours ago||
UX feedback

Me: hmm, this title confuses and infuriates Rob.

[Clicks link]

Me: Sees same title, repeat feelings of confusion and infuration

[Scrolls article down on my smartphone]

Me: Sees jpg with the same title, repeat feelings of co fusion and infuriation.

[Closes tab]

[Continues living rest of my life]

I hope this feedback is well received and understood.

mrcwinn 16 hours ago|
We need a superset of HTML that is designed for agents. I'm not sure it's quite as simple as "just make everything an API."
More comments...