Posted by palashawas 18 hours ago
The only reason you wouldn’t choose an API is if it wasn’t viable.
In particular, the vision-based approach used in the evaluation has clear limitations with regard to efficiency due to its nature (small observation window, heterogeneous modality)
At Smooth we use an hybrid DOM/vision approach and we index very strongly on small models. An interesting fact is that UIs are generally designed to minimize ambiguity and supply all and only the necessary context as token-efficient as possible, and the UX is cabled up to abstract the APIs in well-understood interface patterns e.g. dropdowns or autocompletes. This makes navigation easier and that's why small models can do it, which is another dimension that must be considered
We typically recommend using APIs/MCP where available and well designed, but it's genuinely surprising how token-efficient agentic browser navigation can actually be
The problem is that not everything from the 'past' can be accessed via APIs. It would be a fun time - remember Prism [1] - I would just run that and get all the API calls in a nice format and then replay them over and over to do things in succession.
In the new world, we have access to OpenAPI.json and whatnot, but in the world where things were built in the days pre-OpenAPI and pre-specs and best practices...I am not so sure! (and a lot of world lives then)
Alas, this works for a good chunk of things but not everything. Which is why the other technnology exists.
I think OpenAI designing their own phone is the next logical step. I hope they succeed which should bring major competition to Apple and Android.
Yes, in an ideal world, that'd be great for both humans and LLMs, but we are about as far from that ideal world as we could be. You can't even do some of the "advanced actions" as a human with human-level reflexes without encountering a captcha, but sure, all of a sudden, everyone will just decide to make their bread and butter that is data easier to explore via an LLM.
Why on earth would they offer convenient hooks for AI chatbots?
Competition. If I ask my OS-level AI assistant to find a social media reel about a elephant dancing, the social media app that exposes a set of APIs for an AI agent might get used more.Watch how fast Meta adds this if a new hot shot social media app succeeds by designing for AI agents controlled by users.
This is the exact opposite of what will happen (and in fact what has happened). Reddit is suing Perplexity right now for scraping.
Meta will not serve content to some other app for free - for what benefit? They will not see advertising data.
Advertising isn't the only possible business model.
And profit isn't the only possible motive to provide a service.
A closer analogue would be AppleScript, or rather, the underlying Apple Event and Open Scripting Architecture functionality supplied by the OS to support AppleScript, that allowed applications to expose these interfaces along with metadata documenting them, and for external tools to record manually performed tasks across applications as programs expressed in terms of these interfaces to make them easier to use (this last bit, while not strictly required, is convenient, and especially useful for less technical users).
If you're familiar with VBA in Microsoft Office applications, sort of like that, except with support provided by OS APIs that could be used by any application that chose to implement scripting support, official guidance from Apple suggesting that all well-designed applications should be scriptable and recordable, and application design patterns and frameworks designed with scriptability and recordability in mind.
Note that I use the past tense here, despite AppleScript still being available in macOS, because it is not well-supported by modern applications.
There are no shortcuts in life and its just expensive text autocomplete.
"Lets spin up $750k in GPUs full throttle to scrape a web page with my $200.00 CC subscription."
Everyone is delusional.
I could imagine an AI future where agentic shopping companies who promise me the best deal are pitted against Walmart and Amazon, trying to algorithmically squeeze me for $2 more- just two bots playing a cat and mouse game to save me a few bucks.
For some reason a lot of tech ends up in these antagonistic monopolies- Apple wants to sell privacy aware devices as a product feature, Google wants give you mail and maps, but sell your data. Despite any appearances neither give a shit about you, even if you benefit from the dynamic.
That's only another step in the path I experienced since the 80s, when I had to type every single character because there was no auto complete, no command line history, very few libraries. I was very good at writing trees, hash tables, linked lists and so was everybody else. Nobody would hire me if I were that slow at writing code today.
This is not going to happen, or if it does it will just be Android (like Samsung reskins/modifies it) and it will certainly use Google Play Services.
The good ideas and the bad ideas don't signal success in a bubble, nor does making money or not. Its random and any notion of "this was a good business model and that was bad" is post-hoc rationalization. The number of people who make fun of pets.com but order from chewy.com is a prime example of this.
Isn't that what Apple is doing with its Foundation Models Framework?[0] Developers can integrate Apples on device llm that includes things like tool calling. I don't write Apple specific apps so not sure what can actually be done with it, but it looks promising and something Apple already seems to think things are headed.
> I think OpenAI designing their own phone is the next logical step
ChatGPT is already integrated into Apple Intelligence for those that want to use that instead of Apples model -- I don't see OpenAI trying to change lanes into phone making when they can focus on doing what they know while collecting a large check from Apple
[0] https://developer.apple.com/documentation/foundationmodels
I imagine the AIs will get a lot better at intercepting things at an intermediate level - API calls under the hood, etc. Probably much better (and cheaper) vision abilities, and perhaps even deeper integration into the machine code itself. It's really hard to anticipate what an advanced model will be capable of 5 years from now.
Open source research/project I have been exploring on the topic: https://aevum.build/learn/architecture/
So, like a Unix system?
i'm really not sure companies will allow their apps to be automated so easily, and the reason is api abuse (think of a saas where you can upload file attachments for example); you'd either end up banned or throttled pretty fast, and in the end the company will be like "cost > opportunity" and just close it off (and its like this already, llms just make this worse)
Isn’t the whole ‘promise’ of AI that it doesn’t need any of those things?
Humans would be the second-class users of said OS, which can generate UIs on demand as needed.
I've thought about this quite a bit. Started implementing as a side project, but I have too many side projects at the moment...
It’s why we did this benchmark :) - reflex team member
Perfect.
The modern javascript ecosystem is a perfect example of what happens when everyone tries to rebuild from scratch and it's a nightmare.
I want to be able to control both Mac, apps and the browser. I also need it to figure out things by itself given a goal.
Claude Code with the --chrome flag is kind of good, but it's too slow. I wanted to try faster APIs, like the one hosted on Cerebras, but it's too expensive.
Any solution I might be missing?
Or something you don't understand how to do manually?
Because I guess I don't understand the attraction of using an LLM for system automation where existing interfaces exist, other than as a form of documentation, or to write code using these interfaces.
To me the browser is a translation layer. Working on the browser directly while hard enables big advantages on compatibility. The only thing I miss as of now which is on the todo is ocr of the images in the browser into text out. But an api would need to do that anyways to work.
The main loss in my view of pure API based is, where do you get the data? We won't replicate human work without seeing that done. Humans work in the UI that's it. Computer use to me is the promise of being able to replicate end to end actions a human does. API can do that in theory but the data to do that is also near impossible to collect properly.
idk.. not really thought out too much, but has to be better
Recently, I asked Claude to log into my local grocery store chain's website and add all of the items from my shopping list to a cart. It was hilariously slow, but it did get the job done.
Unless I missed it, the article doesn't explictly mention speed in the copy, but the results do show a 17 minute (!!!) total time for the vision agent vs. 0.5s - 2.8s for the API approach.
A big part of the challenge with vision is that to manipulate the DOM, you first have to be sure the entire (current) DOM is loaded. In my experience this ends up in adding a lot of artificial waits for certain elements to exist on the page.
Thinking of Frigate NVR that does motion > object detection > scene description
Where you build up to progressively slower and more expensive algorithms i.e. there's motion > it's a person > here's what the person is doing
Had we not made that wrong turn, LLMs and humans would have a much easier time reasoning about APIs they don't directly control.
trpc, grpc, etc are all attempts to add schemas back into JSON. Swagger, OpenAPI, etc are attempts to add discover ability back into JSON-based RPC APIs.
MCPs fall in here as well, which attempt to add schemas and discover ability back in where our APIs aren't actually RESTful.
REST has nothing to do with structured data or discoverability.