Posted by serjester 4 days ago
+100 on the footnote:
> agents or workflows?
Workflows. Workflows, all the way.
The agents can start using these workflows once they are actually ready to execute stuff with high precision. And, by then we would have figured out how to create effective, accurate and easily diagnozable workflows, so people will stop complaining about "I want to know what's going on inside the black box".
99.9% of real world enterprise AI use cases today are for workflows not agents.
However, "agents" are being pushed because the industry needs a next big thing to keep the investment funding flowing in.
The problem is that even the best reasoning models available today don't have the actual reasoning and planning capability needed to build truly autonomous agents. They might in a year. Or they might not.
And are there any guidelines on how to manage workflows for a project or set of projects? I’m just keeping them in plain text and including them in conversations ad hoc.
That being said, back in February I was trying out of bunch of AI personal assistant apps/tools. I found, without fail, every single one of them was advertising features their LLMs could theoretically accomplish, but in practice couldn't. Even worse was many of these "assistants" would proactively suggest they could accomplish something but when you sent them out to do it, they'd tell you they couldn't.
* "Would you like me to call that restaurant?"...."Sorry, I don't have support for that yet"
* "Would you like me to create a reminder?"....Created the reminder, but never executed it
* "Do you want me to check their website?"...."Sorry, I don't support that yet"
Of all of the promised features, the only thing I ended up using any of them for was a text message interface to an LLM. Now that Siri has native ChatGPT support, it's not necessary.
It's not like there's a lever in Cursor HQ where one side is "Capability" and one side is "Reliability", and they can make things better just by tipping it back towards the latter.
You can bias designs and efforts in that direction, and get your tool to output reversible steps or bake in sanity checks to blessed actions, but that doesn't change the nature of the problem.
It's like saying rm -rf / should have more safeguards built in. It feels unfair to call out the AI based tools for this.
* "unreliable" may not be the right word. For all we know, the agent performed admirably given whatever the user's prompt may have been. Just goes to show that even in a relatively constricted domain of programming, where a lot (but far from all) outcomes are binary, the room for misinterpretation and error is still quite vast.
Any system capable of automating a complex task will by need be more complex than the task at hand. This complexity doesn't evaporate when you through statistical fuzzers at it.
> For example, if a user with appropriate privileges mistakenly runs ‘rm -rf / tmp/junk’, that may remove all files on the entire system. Since there are so few legitimate uses for such a command, GNU rm normally declines to operate on any directory that resolves to /. If you really want to try to remove all the files on your system, you can use the --no-preserve-root option, but the default behavior, specified by the --preserve-root option, is safer for most purposes.
https://www.gnu.org/software/coreutils/manual/html_node/Trea...
I hypothesize that a $(git fetch --mirror) would pull down the "orphaned" revision, too, but don't currently have the mental energy to prove it
I tend to think that what this article is asking for isn't achievable, because what people mean by "AI" is precisely "we don't know how it works".
An analogy I've used sometimes when talking with people about AI is the "I know a guy" situation. Someone you know comes and tells you "I know a guy who can do X for you", where "do X" is "write your class paper" or "book a flight" or "describe what a supernova is" or "invest your life savings". In this situation, the more important the task, the more you would probably want to know about this "guy". What are his credentials? Has he done this before? How often has he failed? What were the consequences? Can he be trusted? Etc.
The thing that "a guy" and an AI have in common is that you don't know what they're doing. Where they differ is in your ability to gradually gain knowledge. In real life, "know a guy" situations become transformed into something more specific as you gain information about who the person is and how they do what they do, and especially as you understand more about the system of consequences in which they are embedded (e.g., "if this painter had ruined many people's houses he would have been sued into oblivion, or at least I would have heard about it"). And also real people are unavoidably embedded in the system of physical reality which imposes certain constraints that bound plausibility (e.g., if someone tells you "I know a guy who can paint your entire house in five seconds" you will smell a rat).
Asking for "reliability" means asking for a network of causes and effects that surrounds and supports whatever "guy" or AI you're relying on. At this point I don't see any mechanism to provide that other than social and ultimately legal pressure, and I don't see any strong action being taken in that direction.
I've started taking a very data engineering-centric approach to the problem where you treat an LLM as an API call as you would any other tool in a pipeline, and it's crazy (or maybe not so crazy) what LLM workflows are capable of doing, all with increased reliability. So much so that I've tried to package my thoughts / opinions up into an AI SDK for Apache Airflow [1] (one of the more popular orchestration tools that data engineers use). This feels like the right approach and in our customer base / community, it also maps perfectly to the organizations that have been most successful. The number of times I've seen companies stand up an AI team without really understanding _what problem they want to solve_...
The best companies can get up to 90% accuracy. Most are closer to 80%.
But it's important to remember, we're expecting perfection here. But think about this: Have you ever asked someone to book a flight for you? How did it go?
At least in my experience, there's usually a few back and forth emails, and then something is always not quite right or as good as if you did it yourself, but you're ok with that because it saved you time. The one thing that makes it better is if the same person does it for you a couple of times and learned your specific habits and what you care about.
I think the biggest problem in AI accuracy is expecting the AI to be better than a human.
If it's not better across at least one of {more accurate, faster, cheaper} then there is no business. You have to be offering one of the above.
And that applies both to humans and to existing tech solutions: an LLM solution must beat both in some dimension. Current flight booking interfaces are actually better than a human at all three: they're more accurate, they're free, and they're faster than trying to do the back and forth, which means the bar to clear for an agent is extremely high.
Only when you know exactly where to go. If you need to get to customers in 3 cities where order doesn't matter (ie the traveling salemen problem, though you are allowed to hit any city more than once) current solutions are not great. If you want to go on vacation but don't care much about where (almost every place with an airport would be an acceptable vacation, though some are better than others)
I personally struggle to find a new one (AI agent coding assistants already exist, and of course I'm excited about them, especially as they get better). I will not, any time soon, trust unsupervised AI to send emails on my behalf, make travel reservations, or perform other actions that are very costly to fix. AI as a shopping agent just isn't too exciting for me, since I do not believe I actually know what features in a speaker / laptop / car I want until I do my own research by reading what experts and users say.
Transparency? If it worked even unreliably, nobody would care what it does. Problem is stochastic machines aren't engineers, don't reason, are not intelligence.
I find articles attacking Ai but finding excuses in some mouse rather than pointing at the elephant, exhausting.