AI agents: Less capability, more reliability, please

Posted by serjester 3/31/2025

AI agents: Less capability, more reliability, please(www.sergey.fyi)

423 points | 253 commentspage 2

bhu8 3/31/2025|

I have been thinking about the exact same problem for a while and was literally hours away from publishing a blogpost on the subject.

+100 on the footnote:

> agents or workflows?

Workflows. Workflows, all the way.

The agents can start using these workflows once they are actually ready to execute stuff with high precision. And, by then we would have figured out how to create effective, accurate and easily diagnozable workflows, so people will stop complaining about "I want to know what's going on inside the black box".

DebtDeflation 3/31/2025||

I've been building workflows with "AI" capability inserted where appropriate since 2016. Mostly customer service chatbots.

99.9% of real world enterprise AI use cases today are for workflows not agents.

However, "agents" are being pushed because the industry needs a next big thing to keep the investment funding flowing in.

The problem is that even the best reasoning models available today don't have the actual reasoning and planning capability needed to build truly autonomous agents. They might in a year. Or they might not.

breckenedge 3/31/2025||

Agreed, I started crafting workflows last week. Still not impressed with how poorly the current crop of models is at following instructions.

And are there any guidelines on how to manage workflows for a project or set of projects? I’m just keeping them in plain text and including them in conversations ad hoc.

SkyPuncher 3/31/2025||

Unfortunately, the picked example kind of weighs down the point. Cursor has an extremely vocal minority (beginner coders) that isn't really representative of their heavy weight users (professional coders). These beginner users face significant issues that come from being new to programming, in general. Cursor gives them amazing capabilities, but it also lets them make the same dumb mistakes that most professional developers have done once or twice in their career.

That being said, back in February I was trying out of bunch of AI personal assistant apps/tools. I found, without fail, every single one of them was advertising features their LLMs could theoretically accomplish, but in practice couldn't. Even worse was many of these "assistants" would proactively suggest they could accomplish something but when you sent them out to do it, they'd tell you they couldn't.

* "Would you like me to call that restaurant?"...."Sorry, I don't have support for that yet"

* "Would you like me to create a reminder?"....Created the reminder, but never executed it

* "Do you want me to check their website?"...."Sorry, I don't support that yet"

Of all of the promised features, the only thing I ended up using any of them for was a text message interface to an LLM. Now that Siri has native ChatGPT support, it's not necessary.

narmiouh 3/31/2025||

I feel like OP would have been better of not referencing the viral thread about a developer not using any version control and surprised when the AI made changes, I don't think anyone who doesn't understand version control should be using a tool like cursor, there are other SAAS apps that build and deploy apps using AI and for people with the skill demonstrated in the thread, that is what they should be using.

It's like saying rm -rf / should have more safeguards built in. It feels unfair to call out the AI based tools for this.

danso 3/31/2025||

I think it's a useful anecdote because it underscores how catastrophically unreliable* agents can be, especially in the hands of users who aren't experienced in the particular domain. In the domain of programming, it's much easier to quantify a "catastrophic" scenario vs. more open-ended "real world" situations like booking a flight.

* "unreliable" may not be the right word. For all we know, the agent performed admirably given whatever the user's prompt may have been. Just goes to show that even in a relatively constricted domain of programming, where a lot (but far from all) outcomes are binary, the room for misinterpretation and error is still quite vast.

namaria 3/31/2025||

More than that, I think it's quite relevant, because it shows that the complexity in the tooling around writing code manually is not so inessential as it seems.

Any system capable of automating a complex task will by need be more complex than the task at hand. This complexity doesn't evaporate when you through statistical fuzzers at it.

fabianhjr 3/31/2025|||

`rm -rf /` does have a safeguard:

> For example, if a user with appropriate privileges mistakenly runs ‘rm -rf / tmp/junk’, that may remove all files on the entire system. Since there are so few legitimate uses for such a command, GNU rm normally declines to operate on any directory that resolves to /. If you really want to try to remove all the files on your system, you can use the --no-preserve-root option, but the default behavior, specified by the --preserve-root option, is safer for most purposes.

https://www.gnu.org/software/coreutils/manual/html_node/Trea...

layer8 3/31/2025||

That was added in 2006, so didn’t exist for a good half of its life (even longer if you count pre-GNU). I remember rm -rf / being considered just one instance of having to double-check what you do when using the -rf option. It’s one reason it became common to alias rm to rm -i.

empath75 4/1/2025|||

I actually had cursor on in yolo mode with the github integration on and it's actually pretty good about doing commits and pushes and opening PRs and stuff. Though I would never do that in a repo I cared a lot about. It doesn't always know what branch it's on and makes assumptions all the time about what org it's in, etc...

outime 3/31/2025||

Technically, they could be using version control, not have a copy on their local machine for some reason, and have an AI agent issue a `git push -f` wiping out all the previous work.

mdaniel 4/1/2025||

I know this trope appears fairly often, but in reality unless it's someone's copy of $(git serve) running under their desk with no CI attached to it whatsoever, the commit history still exists and one can recover from a force push by just typing the equivalent of github.example.com/MyAwesomeOrg/MyRepo/commit/cafebabedeadbeef

I hypothesize that a $(git fetch --mirror) would pull down the "orphaned" revision, too, but don't currently have the mental energy to prove it

LeifCarrotson 3/31/2025||

Unfortunately, LLMs, natural language, and human cognition largely are what they are. Mix the three together and you don't get reliability as a result.

It's not like there's a lever in Cursor HQ where one side is "Capability" and one side is "Reliability", and they can make things better just by tipping it back towards the latter.

You can bias designs and efforts in that direction, and get your tool to output reversible steps or bake in sanity checks to blessed actions, but that doesn't change the nature of the problem.

BrenBarn 4/1/2025||

Seems related to another recent post: https://news.ycombinator.com/item?id=43542259

I tend to think that what this article is asking for isn't achievable, because what people mean by "AI" is precisely "we don't know how it works".

An analogy I've used sometimes when talking with people about AI is the "I know a guy" situation. Someone you know comes and tells you "I know a guy who can do X for you", where "do X" is "write your class paper" or "book a flight" or "describe what a supernova is" or "invest your life savings". In this situation, the more important the task, the more you would probably want to know about this "guy". What are his credentials? Has he done this before? How often has he failed? What were the consequences? Can he be trusted? Etc.

The thing that "a guy" and an AI have in common is that you don't know what they're doing. Where they differ is in your ability to gradually gain knowledge. In real life, "know a guy" situations become transformed into something more specific as you gain information about who the person is and how they do what they do, and especially as you understand more about the system of consequences in which they are embedded (e.g., "if this painter had ruined many people's houses he would have been sued into oblivion, or at least I would have heard about it"). And also real people are unavoidably embedded in the system of physical reality which imposes certain constraints that bound plausibility (e.g., if someone tells you "I know a guy who can paint your entire house in five seconds" you will smell a rat).

Asking for "reliability" means asking for a network of causes and effects that surrounds and supports whatever "guy" or AI you're relying on. At this point I don't see any mechanism to provide that other than social and ultimately legal pressure, and I don't see any strong action being taken in that direction.

jlaneve 3/31/2025||

I appreciate the distinction between agents and workflows - this seems to be commonly overlooked and in my opinion helps ground people in reliability vs capability. Today (and in the near future) there's not going to be "one agent to rule them all", so these LLM workflows don't need to be incredibly capable. They just need to do what they're intended to do _reliably_ and nothing more.

I've started taking a very data engineering-centric approach to the problem where you treat an LLM as an API call as you would any other tool in a pipeline, and it's crazy (or maybe not so crazy) what LLM workflows are capable of doing, all with increased reliability. So much so that I've tried to package my thoughts / opinions up into an AI SDK for Apache Airflow [1] (one of the more popular orchestration tools that data engineers use). This feels like the right approach and in our customer base / community, it also maps perfectly to the organizations that have been most successful. The number of times I've seen companies stand up an AI team without really understanding _what problem they want to solve_...

[1] https://github.com/astronomer/airflow-ai-sdk

bendyBus 3/31/2025||

"If your task can be expressed as a workflow, build a workflow". 100% true but the thing all these 'agent pattern' or 'workflow' diagrams miss is that real tasks require back-and-forth with a user, not just a Rube Goldberg machine that gets triggered in response to a _single user message_. What you need is not 'tool use' but something like 'process use'. This is what we did at Rasa, giving you a declarative way to define multi-step processes. An LLM lets you have a fluent conversation, but the execution of the task is pre-defined and deterministic: https://rasa.com/docs/learn/concepts/calm/ The fact that every framework starts with a `while` loop around an LLM and then duct-tapes on some "guardrails" betrays a lack of imagination.

jedberg 3/31/2025||

I've been working on this problem for a while. There are whole companies that do this. They all work by having a human review a sample of the results and score them (with various uses of magic to make that more efficient). And then suggest changes to make it more accurate in the future.

The best companies can get up to 90% accuracy. Most are closer to 80%.

But it's important to remember, we're expecting perfection here. But think about this: Have you ever asked someone to book a flight for you? How did it go?

At least in my experience, there's usually a few back and forth emails, and then something is always not quite right or as good as if you did it yourself, but you're ok with that because it saved you time. The one thing that makes it better is if the same person does it for you a couple of times and learned your specific habits and what you care about.

I think the biggest problem in AI accuracy is expecting the AI to be better than a human.

lolinder 3/31/2025||

> I think the biggest problem in AI accuracy is expecting the AI to be better than a human.

If it's not better across at least one of {more accurate, faster, cheaper} then there is no business. You have to be offering one of the above.

And that applies both to humans and to existing tech solutions: an LLM solution must beat both in some dimension. Current flight booking interfaces are actually better than a human at all three: they're more accurate, they're free, and they're faster than trying to do the back and forth, which means the bar to clear for an agent is extremely high.

bluGill 3/31/2025||

> Current flight booking interfaces are actually better than a human at all three

Only when you know exactly where to go. If you need to get to customers in 3 cities where order doesn't matter (ie the traveling salemen problem, though you are allowed to hit any city more than once) current solutions are not great. If you want to go on vacation but don't care much about where (almost every place with an airport would be an acceptable vacation, though some are better than others)

morsecodist 3/31/2025||

This is really cool. I agree with your point that a human would also struggle to book a flight for someone but what I take from that is conversation is not the best interface for picking flights. I am not really sure how you beat a list of available flights + filters. There are a lot of criteria: total fight time, price, number of stops, length of layover, airline, which airport if your destination is served by multiple airports. I couldn't really communicate to anyone how I weigh those and it shifts over time.

_cs2017_ 3/31/2025||

Does anyone have AI agent use cases that that you think might happen within this year and that feels very exciting to you?

I personally struggle to find a new one (AI agent coding assistants already exist, and of course I'm excited about them, especially as they get better). I will not, any time soon, trust unsupervised AI to send emails on my behalf, make travel reservations, or perform other actions that are very costly to fix. AI as a shopping agent just isn't too exciting for me, since I do not believe I actually know what features in a speaker / laptop / car I want until I do my own research by reading what experts and users say.

hirako2000 3/31/2025|

The problem with Devin wasn't that it was a black box doing too much. It's that the outcome demo'd were fake and what was inside the box wasn't an "AI engineer."

Transparency? If it worked even unreliably, nobody would care what it does. Problem is stochastic machines aren't engineers, don't reason, are not intelligence.

I find articles attacking Ai but finding excuses in some mouse rather than pointing at the elephant, exhausting.

More comments...