AI agents: Less capability, more reliability, please

Posted by serjester 3/31/2025

AI agents: Less capability, more reliability, please(www.sergey.fyi)

423 points | 253 commentspage 3

tristor 3/31/2025|

The thing I most want an AI agent to do is something I can't trust to any third-party, it'd need to be local, and it's something well within LLM capabilities today. I just want a "secretary in my pocket" to take notes during conversations and produce minutes, but do so in a way that's secure and privacy-respecting (e.g. I can use it at work or at home).

htrp 4/1/2025|

get a pixel/s24 with on device asr & summaries

kuil009 3/31/2025||

It's natural to expect reliability from AI agents — but I don't think Cursor is a fair example. It's a developer tool deeply integrated with git, where every action can have serious consequences, as in any software development context.

Rather than blaming the agent, we should recognize that this behavior is expected. It’s not that AI is uniquely flawed — it's that we're automating a class of human communication problems that already exist.

This is less about broken tools and more about adjusting our expectations. Just like hunters had to learn how to manage gunpowder weapons after using bows, we’re now figuring out how to responsibly wield this new power.

After all, when something works exactly as intended, we already have a word for that: software.

bigfishrunning 3/31/2025|

Lol software is a field that pretty severely lacks rigor -- if software is "something that works exactly as intended", then you've had a very different experience in this industry then I have.

kuil009 3/31/2025||

Garbage in, garbage out. Like it or not, even someone’s trashy intentions can run exactly as designed — so I guess we’ve had the same experience.

ankit219 3/31/2025||

Agents in the current format are unlikely to go beyond a current levels of reliability. I believe agents are a good use case in a low trust environments (outside of coding where you could see the errors quickly with testing or deployment) like inter-company communications and tasks, where there are already systems in place for checks and things going wrong. Might be a hot space in some time. For intra company, high trust environment cannot just be a workflow automation given any error would need the knowledge worker to redo the whole thing to check if its correct. We can do it via other agents - less chances of it going wrong - but more chances it screws up in the same place as previous one.

rambambram 3/31/2025||

I heard you, so we decided to now tweak the dials a bit. The dial for 'capability' we can turn back a little, no problem, but the dial for 'reliability', uhm yeah... I'm sorry, but we couldn't find that dial. Sorry.

killjoywashere 3/31/2025||

We have been looking at Hamming distance vs time to signature for ambient note generation in medicine. Any other metrics? Lots of metrics in the ML papers, but a lot of them seem sus. They take a lot of work to reproduce or they are designed around some strategy like maxing out the easy true negatives (so you get desirable accuracy and F1 score), etc. as someone trying to build validation protocols I can get vendors to enable (need them to write certain data from memory to a DB table we can access) I’d welcome that discussion. Right now the MBAs running the hospital systems are doing whatever their ML buddies say without regard to patient or provider.

janalsncm 3/31/2025||

I think many people share the same sentiment. We don’t need agents that can kind of do many things. We need reliable programs that are really good at doing a single thing. I said as much about Manus when it came out.

https://news.ycombinator.com/item?id=43350950

There are mistakes in the Manus demo if you actually look at it. And with so many AI demos, they never want you to look too closely because the thing that was created is fairly mediocre. No one is asking for the tsunami of sludge except for VCs apparently.

cryptoz 3/31/2025||

This is refreshing to read. I, like everyone apparently, am working on my own coding agent [1]. And I suppose it's not that capable yet. But it sure is getting more reliable. I have it only modify 1 file at a time. It generates tickets for itself to complete - but never enough tickets to really get all the work done. The tickets it does generate, however, it often can complete (at least, in simple cases haha). The file modification is done through parsing ASTs and modifying those, so the AI doesn't go off and do all kinds of things to your whole codebase.

And I'm so sick of everything trying for 100% automation and failing. There's a place for the human in the loop, in quickly identifying bugs the AI doesn't have the context for, or large-scale vision, or security or product-focused mindset, etc.

It's going to be AI and humans collaborating. The solutions that figure that out the best are going to win IMO. AI won't be doing everything and humans won't be doing it all either. The tools with the best human-AI collaboration are where it's at.

[1] https://codeplusequalsai.com

helltone 3/31/2025|

How do you modify ASTs?

cryptoz 3/31/2025||

I support HTML, JS, Python and CSS. For HTML, (not technically an AST), I give the LLM the original-file HTML source, and then I instruct it to write python code that uses BeautifulSoup to modify the HTML. Then I get the string back from python of the full HTML file, modified according to the user prompt.

For python changes I use ast and astor packages, for JS I use esprima/escodegen/estraverse, and for CSS I use postcss. The process is the same for each one: I give the original input souce file, and I instruct the LLM to parse the file into AST form and then write code that modifies that AST.

I blogged about it here if you want more details! https://codeplusequalsai.com/static/blog/prompting_llms_to_m...

skydhash 3/31/2025||

I took a look at your project and while it's nice (technically), for the actual use case shown, I can't see the value over something like the old Dreamweaver with a bit of training.

I still think like prompting is still the wrong interface for programming systems. Even though they're restricted, configurations forms, visual programming with nodes, and small scripts attached to objects on a platform is way more reliable and useful.

cryptoz 3/31/2025||

Appreciate you having a look and for that feedback, thanks - I do agree I have work to do to prove that my idea is better than alternatives. We'll see...

whatnow37373 4/1/2025||

Agents introduce causality, reflection, necessity and various other sub-components never to be found in purely stochastic completion engines. This is an improvement, but it does require breaking down what each "agent" needs to do. What are the "core components" of cognition?

That's why I claim that any sufficiently complicated cognitive architecture contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of Immanuel Kant's work.

wg0 3/31/2025||

Totally agree with author here. Also, reliability is pretty hard to achieve when the underlying models are all mountains of probability that no one yet understands how they do what they exactly do and how to precisely fix a problem without affecting other parts.

Here's CNBC Business is pushing greed that these aren't AI wrappers but next best thing after fire, bread and axe[0]

[0]. https://youtu.be/mmws6Oqtq9o

freeamz 3/31/2025|

same can be said about digital tech/infrastructure in general!

wg0 3/31/2025||

I can't say that based on what I know about both.

Havoc 4/1/2025|

What has me slightly puzzled is why there isn’t a sharp pivot towards typed languages for vibe coding.

Would be much easier for the AI/IDE to confirm the code is likely good. Or well better than untyped. The whole rust if it compiles it probably works thing.

Instead it’s all python/JS let LLM write code and pray you don’t hit run time errors on a novel code path

I get that there is more python training data but still seems like the inferior fit for LLM assisted coding

More comments...