Posted by palashawas 18 hours ago
Well, if your backend was sufficiently decoupled from your frontend, and the server-side operations were designed thoughtfully and generically, it need not be an engineering project.
Hang on, that sounds like common corporate SaaS apps.
It's kind of fascinating that we never were willing to do these things for humans but now that AI needs it ... we are all in. A bit depressing in the sense that I think mostly the reason we happy to do it for AI is that we perceive it will benefit us personally rather than some abstract future human.
> It's kind of fascinating that we never were willing to do these things for humans but now that AI needs it ... we are all in. A bit depressing in the sense that I think mostly the reason we happy to do it for AI is that we perceive it will benefit us personally rather than some abstract future human.
I don't think that's the reason.
I think it's because they take time, and few people were willing to put in time for "maybe it'll make writing the actual code faster" gains when the code was going to take a few times longer to write itself.
You also can get faster feedback to iterate on your spec now, which improves the probability of it helping future-you.
So combine that with the fact that the llms are more likely to get lost if you don't spec stuff in advance, and the value of up-front work is higher (whereas a human is more likely to land on the right track, just more slowly than otherwise, making the value harder to quantify).
Actually there's a lot of projection there too; I don't read documentation in detail. And nowadays, I point an LLM at documentation so that it can find the details I would otherwise skip over.
The destruction of the millennial attention span is real, and it's worse in the younger generations, lmao.
I guess that just never occurred to anybody before.
One of the best parts of LLMs is that you can use them to bootstrap your documentation, or scan for outdated things, etc, far more quickly than ever before.
Don't just throw a mountain at it and ask it to get it right, but use a targeted process to identify inconsistencies, duplicates, etc, and then resolve those.
And then you have better onboarding material for the next human OR llm...
No, that's forward. Any documentation an AI can make, another AI can regenerate. If an LLM didn't write the code, it shouldn't document it either. You don't want to bake in slop to throw off the next LLM (or person).
Somebody pointed out that those Markdown files might be helpful for people to read directly. Bit of an Emperor's new clothes moment. (I wanted to slap a : rolling_on_the_floor_laughing: reaction on it, but sadly it turns out I'm actually too chickenshit to do that in today's job market.)
In fact, the only area I've been struggling with are "Concepts" because they have less clear boundaries for the right amount of detail.
Here is what I've been working on: https://github.com/super-productivity/super-productivity/wik...
Almost sounds like an Orielly book
Matthew B. Doar (2011). Practical JIRA Plugins. O’Reilly.
https://www.oreilly.com/library/view/practical-jira-plugins/...
In case anyone was wondering. Which they probably weren’t :p
We built isagent.dev for exactly this reason, serve human content to humans, serve agent optimized content to agents.
Generative AI wasn't a thing at the time, but I had to resort to a combination of OCR, simulated user input, and print capture to drive the application and export data.
Had the developers been aware of the Windows DRM APIs that block screen capture, or the fact that text is easily recoverable from PostScript files with minimal formatting, I don't know what I would have done.
The irony is that the process this replaced involved giving cheap offshore labor full read-only remote access to all data in the system, which was by any measure a far more serious security risk than otherwise authorized employees using tools running locally with no network access provided by established, trustworthy vendors to automate their work.
The landing page doesn't advertise it yet, but essentially, I give agents a small set of tools to explore apps' surfaces, and then an API over common macOS functions, especially those related to accessibility.
The agent explores the app, then writes a repeatable workflow for it. Then it can run that workflow through CLI: `invoke chrome pinTab`
Why accessibility? Well, turns out that it's just a good DOM in general. It's structure for apps. Not all apps implement it perfectly, but enough do to make it wildly useful.
[1] https://getinvoke.com - note that the landing page is targeted towards creatives right now and doesn't talk about this use case yet
One thing I am curious about is a hybrid approach where LLMs work in conjunction with vision models (and probes which can query/manipulate the DOM) to generate Playwright code which wraps browser access to the site in a local, programmable API. Then you'd have agents use that API to access the site rather than going through the vision agents for everything.
https://playwright.dev/docs/getting-started-mcp#accessibilit...
I've mentioned several times and gotten snarky remarks about how rewriting your code so it fits in your head, and in the LLM's context helps the LLM code better, to which people complain about rewriting code just for an LLM, not realizing that the suggestion is to follow better coding principles to let the LLM code better, which has the net benefit of letting humans code better! Well looks like, if you support accessibility in your web apps correctly, Playwright MCP will work correctly for you.
Amazing.
Harder to scale if it's doing a lot of them, I suppose.
Most wikis you can mirror locally if you really need to hammer them.
and now the fact that interfaces need to be accessible to agents, not just humans, ironically increases it for humans in return
I think this is very fertile ground - big labs need to use approaches that can work on multiple platforms and arbitrary workflows, and full-page vision is the lowest common denominator. Platform-specific approaches are a really exciting open space!
https://accessibilityinsights.io/
https://learn.microsoft.com/en-us/windows/win32/winauto/insp...
https://github.com/FlaUI/FlaUInspect
and for WPF applications specifically,
i so far haven't found any application that doesn't.
all you're able to get out, as far as i can tell, is the length of the entered password.
https://devblogs.microsoft.com/cppblog/spy-internals/
Obviously, if you can inject code into a process that receives sensitive data, you're already running in a context where all security bets are off.
But with processes you yourself create, you probably can, even without elevated privileges, unless the application takes measures to prevent injection (akin to game anticheat mechanisms), so it seems worth pointing out that there are simple mechanisms to subvert such "protected" fields that don't require application-specific reverse engineering.
Now the argument against this on [reddit](https://www.reddit.com/r/openclaw/comments/1s1dzxq/comment/o...)
"my experience is the opposite actually. UIA looks uniform on paper but WPF, WinForms, and Win32 all expose different control patterns and you end up writing per-toolkit handlers anyway. Qt only exposes anything if QAccessible was compiled in and the accessibility plugin is loaded at runtime, which on shipped binaries is basically never. Electron is just as opaque on Windows as on macOS because it's the same chromium underneath drawing into a canvas. the real split isn't OS vs OS, it's native toolkit vs everything else."
i tend to think of invoke as "an API over macOS apps" tho...
doesn't `invoke finder shareAndCopyLink` read very nicely? :P
in the context of this blog post, the conclusion looks similar though!
"use the whole web like it's an API"
works much better than
"figure out similar or identical tasks from a clean slate every single time you do them"
invoke rather has overlap with Claude's and Codex' computer-use, except the steps are stored/scripted.
webmcp is bottom-up. computer-use & invoke are top-down
_of course_ computer use is worse. It is your last resort. Do not use it on state that lives in a DB that you own.
If anything I am impressed that it’s only 50x worse.
If I think an LLM is good for something I create well defined, very deterministic "middleware" for that purpose on top of Openrouter.
Anthropic even says, that an agent based solution should only be your last resort and that most problems are well served with a one-shot.
https://www.anthropic.com/engineering/building-effective-age...
I'm much more agreeable with that type of LLM workflow. Running "agents" with monolithic "harness" for long time horizon tasks seems wasteful, unecessary but probably super appealing to lazy people.
Agent use can be used to improve quality and maintainability
If one agent just explores the UI, maybe in a test environment, and outputs a somewhat-structured description of the various UI elements and their behavior, then another agent was given that description, would the other agent perform better that an agent that both explores the UI and tries to accomplish the given task at the same time?
With an example UI I made up, the description (API-like interface definition) could be something like:
Get all reviews:
To get all the reviews you need to go to each page and click "show full review" for every review summary in that page.
Go to each page:
Start at page 1 (the default when in the Reviews tab). Continue by clicking the "next" button until the "next" button is no longer available (as you've reached the last page).
So the second agent can skip some thinking about how to navigate because it already has that skill. The first agent can explore the UI on its own, once, without worrying about messing up if there's a test environment.Or am I misunderstanding the article completely? Probably. But it's interesting nonetheless. Sorry if it makes no sense.
For better and worse, 5-10Mi isn't uncommon for a web app.
Instead of trying to go "bottom up" and, effectively, do what a browser engine is doing in reverse, it seems easier to go "top down" like a human does and go off the visual representation.
No most vision models focus on subset of an image at a time when using image -> text
image -> image uses whole image.
My core idea was that "fast" perception loop is fully local, GPU optimised for UI tokenisation and change detection. "Slow" control loop requires LLM roundtrip, and uses token-efficient markdown interface in CLI output.
It uses relatively stable identifiers for controls, so agents can script common actions, eg `desktopctl pointer click --id btn_save` doesn't require UI tokenisation loop.
The best GUIs make great use of muscle memory, which makes them perfect candidates for scripting via CLI. eg a simple sequence "open Notes app, hit Cmd+F, enter search term, read list of results" can be one Bash command invoked by AI agent.
> build temp housing for it
everyone knows the real trouble starts when the monkey asks for the voteI don't think many realize how could the cheap, alternative models are becoming. I prefer SOTA models for key work, but I can also spend 10X as many tokens on an open model hosted by a non-VC subsidized provider (who is selling at a profit) for tasks that can tolerate slightly less quality.
The situation is only getting better as models improve and data centers get built out.
Bedrock isn't the cheapest either although I'm fairly sure they aren't being VC subsidized
There are definitely cheap tokens out there. The big gotcha is "for tasks that can tolerate slightly less quality"
I think everyone making claims that inference is getting more expensive are unaware that there are more LLM providers than Google, Anthropic, and OpenAI.
Face-scanning? Iris patterns?
https://www.google.com/search?q=identify+anonymous+visa+mast...
Try the exhorbitant expenses and ballooning waste of generated electricity and usable water.