Agents need control flow, not more prompts

Posted by bsuh 18 hours ago

Agents need control flow, not more prompts(bsuh.bearblog.dev)

480 points | 238 comments

827a 14 hours ago|

1000% agree. I am increasingly hesitant to believe Anthropic's continual war drum of "build for the capabilities of future models, they'll get better".

We've got a QA agent that needs to run through, say, 200 markdown files of requirements in a browser session. Its a cool system that has really helped improve our team's efficiency. For the longest time we tried everything to get a prompt like the following working: "Look in this directory at the requirements files. For each requirement file, create a todo list item to determine if the application meets the requirements outlined in that file". In other words: Letting the model manage the high level control flow.

This started breaking down after ~30 files. Sometimes it would miss a file. Sometimes it would triple-test a bundle of files and take 10 minutes instead of 3. An error in one file would convince it it needs to re-test four previous files, for no reason. It was very frustrating. We quickly discovered during testing that there was no consistency to its (Opus 4.6 and GPT 5.4 IIRC) ability to actually orchestrate the workflow. Sometimes it would work, sometimes it wouldn't. I've also tested it once or twice against Opus 4.7 and GPT 5.5; not as extensively; but seems to have the same problems.

We ended up creating a super basic deterministic harness around the model. For each test case, trigger the model to test that test case, store results in an array, write results to file. This has made the system a billion times more reliable. But, its also made the agent impossible to run on any managed agent platform (Cursor Cloud Agents, Anthropic, etc) because they're all so gigapilled on "the agent has to run everything" that they can't see how valuable these systems can be if you just add a wee bit of determinism to them at the right place.

DrewADesign 14 hours ago||

I used to assume they pushed people into the prompt-only workflows because you’re paying them for the tokens, and not paying them for the scaffolding you built. However, I think that they’re really worried about is that a person needs to design and implement that stuff… It throws a wet blanket on their insistence that this will replace entire people in entire workflows or even projects, and I just don’t buy it. I do think it’s going to increase productivity enough to disastrously affect developer job market/pay scale, but I just don’t think this particular version of this particular technology is going to actually do what they say it will. If they said they were spending this much money bootstrapping a super useful thingy that can reduce a big chunk of the busy work of a human dev team— what most developers really want, and most executives really don’t— a bunch of investors would make them walk the plank.

I also think having granular, tightly controlled steps is much friendlier to implementing smaller, cheaper, more specialized models rather than using some ginormous behemoth of a model that can automate your tests, or crank out 5 novels of CSI fan fic in a snap.

cogman10 13 hours ago|||

> However, I think that they’re really worried about is that a person needs to design and implement that stuff… It throws a wet blanket on their insistence that this will replace entire people in entire workflows or even projects, and I just don’t buy it.

I think you are on to something. But I also think this sort of system lends itself to not needing really good LLMs to do impressive things. I've noticed that the quality of a lot of these LLMs just gets worse the more datapoints they need to track. But, if you break it up into smaller and easier to consume chunks all the sudden you need a much less capable LLM to get results comparable or better than the SOTA.

Why pay extra money for Opus 4.7 when you could run Qwen 3.6 35b for free and get similar results?

devin 8 hours ago|||

And then you realize that what you’re using the smaller models for is ALSO decomposable and part of it is just a few if statements, and then you realize that for this feature you don’t actually need or want a model because the performance, reliability, reproducibility are cheaper and better for you and your users.

jimbokun 7 hours ago||

So you have the model write the if statements and put itself out of a job.

aleqs 5 hours ago||||

Indeed, I've been experimenting with agent workflows, for complicated tasks - where I essentially have a graph of agents with different roles/capabilities, including such things as breaking down complex tasks into simpler ones. There seems to be a point where a complex enough task is better performed by a group of cheaper agents/models than by one agent using one of the SOTA big models, in terms of both quality and cost.

tempest_ 11 hours ago|||

It is also interesting because you get people with very different use cases arguing about the effectiveness of various models but doing very different things with them.

Its one things for a model to be very clearly instructed to add a REST endpoint to an existing Django app and add a button connected to it on the front vs "Design me a youtube". The smaller models can pretty dependably do the first and fall flat on the second.

pishpash 14 hours ago||||

Aren't they just buying time to build you whatever harness you need? They want to be the only software engineering shop in the world.

user34283 12 hours ago|||

The designing and implementing of a code harness in your workflow can be as simple as running something like /skill-builder.

You prompt for what you want it to do, and it will write eg. python scripts as needed for the looping part, and for example use claude -p for the LLM call.

You can build this in 10 minutes.

I don’t use a cloud platform, so I can’t comment on that part. I‘d say just run it on your own hardware, it’s probably cheaper too.

fny 6 hours ago|||

Secret: "compile" that orchestration prompt. Determinism is solved by turning prompts into code that can in turn run agents or run code or both.

Everyone misses this pattern with skills: you can just drop code alongside a SKILL.md to guarantee certain behaviors, but for some reason everyone's addicted to writing prompts. You don't even need to build a CLI. A simple skill.py with tasks does it. You can even have helpers that call `claude -p`!

krzyk 1 hour ago||

Could you elaborate what does "compiling orchestration prompt" mean?

Frost1x 16 minutes ago|||

When you get some abstraction working you concretize it in something deterministic, or sort of “cache” that knowledge bit (aka write me a function, class, library, whatever). In the future, the nondeterministic path now has a deterministic piece to lean on as it explores the problem space. Rinse, repeat, eventually you have a mostly deterministic system now. Leave flexibility in space where you need that nondeterminism.

throawayonthe 41 minutes ago|||

a guess but i think they mean take the orchestration prompt and prompt yet another llm to turn that prompt into code..?

throawayonthe 42 minutes ago|||

> But, its also made the agent impossible to run on any managed agent platform (Cursor Cloud Agents, Anthropic, etc)

couldn't you "just" have it orchestrate a bunch of subagents? a la the superpower skill

definitely a worse solution, non deterministic orchestration + way higher token usage (unless there's a way to hide the subagent output from the orchestrator agent? i haven't used any of these platforms) but could work in some cases

bob1029 11 hours ago|||

I saw a major uplift in performance after I combined tools like apply_patch with check_compilation & run_unit_tests. I still call the tool "apply_patch", but it now returns additional information about the build & tests if the patch succeeds. The agent went from ~80% success rate to what seems to be deterministic (so far). I don't bother to describe the compilation and unit testing processes in my prompts anymore. All I need to do is return the results of these things after something triggers them to run as a dependency.

I feel like I'm falling out of whatever is popular these days. I've been using prepaid tokens and custom harnesses for a long time now. It just seems to work. I can ignore most of the news. Copilot & friends are currently dead to me for the problems I've expressly targeted. For some codebases it's not even in the same room of performance anymore, despite using the exact same GPT5.4 base model.

woeirua 13 hours ago|||

I have but one upvote, but yes. The only way to make these systems work reliably is to break the problems down into smaller chunks. Any internal consistency checks are just going to show you that LLMs are way less consistent than you’d expect.

rdedev 13 hours ago|||

I had to create a hypothesis testing agent where it gets a query like "is manufacturing parameter x significantly different this month than last month" and have the agent follow a flowchart to run a statistical test and return the answer

At the time I had access to only 4o and there was no way to guarantee that the agent would follow the flowchart if I just mention it in its prompt. What I ended up wrapping the agent in a loop that kept feeding it the next step in the flowchart. In a way, a custom harness for the agent

julianlam 11 hours ago|||

> This started breaking down after ~30 files. Sometimes it would miss a file. Sometimes it would triple-test a bundle of files and take 10 minutes instead of 3. An error in one file would convince it it needs to re-test four previous files, for no reason. It was very frustrating.

Sorry, you thought a prompt was a suitable replacement for a testing suite?

zapataband1 10 hours ago|||

hey man it works great barely and also costs a bunch of money everytime we run it. we also can't trust the results, relax.

deadbabe 8 hours ago|||

If you are invested in AI stocks, this is the way. You are basically funneling money from software companies into your brokerage account. Keep going.

andy12_ 2 hours ago|||

Isn't this already possible to implement with skills and subagents? Like have a skill saying "to test these files run this script that executes a subagent for every markdown file, then check the results".

mmis1000 14 hours ago|||

> This started breaking down after ~30 files.

Codex's short context and todolist system combined somehow helps here though. Because of the frequent compact. The model was forced to recheck what todo list item has not done yet and what workflow skill it has to use. I used to left it for multi hour to do a big clean up and it finished without obvious issues.

swores 14 hours ago||

Is Codex willing to do "multi hour" tasks when used with a ChatGPT Plus subscription, or does it need something more expensive like Pro?

dns_snek 4 hours ago|||

It's going to work the same regardless of how much you pay, but with Plus you'll run into 5h usage limit rather quickly unless your "multi hour task" spends 90% of the time just waiting around for code to compile. Expect to get an hour or two of active work (single-threaded).

shivnathtathe 1 hour ago||||

If you have any org email, you can get free chatgpt + subscription.

dnh44 13 hours ago|||

I regularly get codex to do multi hour tasks with a single prompts I don't think thats a big deal anymore. But you don't want a single agent doing all the work. The root agent needs to delegate the work to sub agents. For example, a sub agent for context gathering, then one for planning, then one (or more) for implementation, then another for review. This way the root agent doesn't use up its context window and it just manages from a bird's eye view. I do have the $200 plan though.

jiehong 4 hours ago|||

This might be inherent to how the models are benchmarked.

Aren’t some benchmarks giving the model multiple shots at a problem and only keep the successful result if it appeared, ignoring the failure rate?

andyferris 4 hours ago||

Good point. We need the mean, “any 1 of 10” and the “all 10 of 10” success rates in the metrics, so we can estimate reliability (the last one).

krashidov 7 hours ago|||

> We've got a QA agent that needs to run through, say, 200 markdown files of requirements in a browser session. Its a cool system that has really helped improve our team's efficiency. For the longest time we tried everything to get a prompt like the following working: "Look in this directory at the requirements files. For each requirement file, create a todo list item to determine if the application meets the requirements outlined in that file". In other words: Letting the model manage the high level control flow.

This is cool. Can you elaborate on it? Is it flaky? Does it take a long time?

cheshire_cat 9 hours ago|||

Wouldn't it be more efficient to convert the requirements these 200 markdown files into Playwright tests?

You could still use an LLM to write and extend the tests, but running the tests would be deterministic and would use less resources.

tharkun__ 9 hours ago||

This type of thing so much.

AI is being pushed so much at work right now. For non-dev stuff even. The amount of things that people think are "awesome never seen this" is staggering.

Just because you haven't seen file format X converted to file format Y before and now you asked the LLM to do it and it worked, doesn't mean you needed an LLM for it nor that it's remarkable. The LLM knew how to do it because it learned from a bazillion online sources for deterministic converters that cost nothing (and have open source). But now you're paying, every single time, for a non-deterministic version of it and you find it cool. It's magic ...

But I guess they deserve it.

gofreddygo 8 hours ago|||

> It's magic

you'll be surprised with how many people are comfortable attributing something they do not understand to Magic.

more than anything, ai let people who couldn't and wouldn't bother to learn to write simple code, to side step ones who can and build solutions to scratch their own itch. that too faster.

now human behavior kicks in, and they don't want to hand control back into the hands of people who can code to solve problems.

put this together and you have a good model to understand the AI sales pitch... Its magic

like all magic, its but a trick.

dkersten 2 hours ago||

Oh, yes! As someone who has dabbled in card tricks, this so much. People don't understand how its done and can't imagine or conceive of a way that it possibly could be done, so they attribute it to literal magic or demons or whatever. Like, no, I just distracted you for a split second and used sleight of hand.

Technology is no different: someone has never even considered that this thing could be possible, and now they see it with their own eyes? Incredible! They don't realise that its mundane and has been possible (in much cheaper ways) for a long time. It was like a few years ago when some journalist posted an animation showing how Horizon Zero Dawn does frustum culling and all the non-tech people were all "wow! This game unloads the game world when its not in view! Incredible!", like... yeah? That's how games have worked since the advent of 3D?

awongh 12 hours ago|||

The other part of the question is exactly when the "build for the capabilities of future models" becomes the present.

Looking at the Mythos benchmarks, it doesn't seem like the models are that close to being truly reliable for agentic tasks.

Is it a year away, or five? That's a big difference in deciding what to build today.

Joeri 14 hours ago|||

You could have a skill that is the combination of a minimal markdown file and a set of orchestration scripts that do the deterministic work. The agent does not have to “run everything”, it just needs to know how to launch the right script.

sharperguy 13 hours ago|||

So I wonder, if a more powerful agent harness could have the agent basically write and exectute its own deteministic code, which when executed, spawns sub agents for each of the subtasks?

So far we've seen agents spawn subagents directly, but that still means leaving the final flow control to the non-deterministic orchestrator model, and so your case is a perfect example of where it would probably fail.

tonylucas 13 hours ago|||

I've been working on an integrated deterministic/agent integrated system for a few months now. It basically runs an AI step to build a plan, which biases towards deterministic steps as much as possible but escalates back to AI when it needs to (for AI only capabilities or deterministic failures) so effectively (when I perfect it, I'm about 90% there) it can bounce back and forward as needed with deterministic steps launching AI steps and AI steps launching deterministic steps as needed.

Probably not explaining it very well but I think it's pretty effective at reducing token usage.

dkersten 2 hours ago|||

I've been building a workflow engine for agent orchestration and the workflows are just data for the engine to execute. While I haven't experimented with it yet, I envision that an LLM would be rather good at generating the workflows based on a description of your needs (and context about how best to utilise the workflow engine).

LLM's are pretty good at reasoning about workflows, its just that when they have to apply them directly, the workflow context gets muddled with your actual tasks context. That's why using an orchestration agent that delegates work to worker agents works so much better.

I still think there's a huge amount of value in having the workflow executed in a deterministic way (as code, or by a workflow engine) because it saves tokens, eliminates any possibility of not following it, and unlocks other cool things, like being able to give each step in the workflow its own focused task-specific context, splitting plans into individual actions and feeding them through a workflow one by one, and having workflow-step specific verification.

But that workflow absolutely CAN be created by an LLM, it just shouldn't be executed by one.

shripadt 12 hours ago|||

[flagged]

peyton 10 hours ago|||

I make codex do everything through a giant `justfile`. Simple, greppable, self-documenting, works great, and I don’t even need to read it.

sroussey 14 hours ago|||

I’m working on a hybrid system of old school task graph and ai agents and let them instantiate each other. I think others will do that eventually.

tonylucas 13 hours ago|||

I'm working on something similar (won't link to it as don't want people to think I'm spamming) but if you want to compare notes happy to talk.

cluckindan 14 hours ago|||

Jira for agents?

werrett 6 hours ago||

c.f. Linear for Agents

https://linear.app/agents

crsn 13 hours ago|||

Our team at Agentforce recently open-sourced our solution to this and we've gotten very valuable feedback -- would love to hear from more of you about it: https://github.com/salesforce/agentscript

zapataband1 10 hours ago||

No you didn't

"What we're not open sourcing (yet) is the runtime. "

otikik 3 hours ago|||

I never tell claude to "go over this bunch of files and do this".

I tell it "write a program that goes over this bunch of files and do this".

Sometimes "do this" can be invoking another claude instance.

imtringued 3 hours ago|||

I'm personally surprised by this too. Like, everyone is writing how insanely productive AI is making them, but that productivity doesn't seem to have translated into any innovations beyond model quality.

Like, most of the stuff needed to make AI better is stuff that could have been written by hand in 2015, so why hasn't anyone used their agents to do so?

To be fair, there is probably a way to make it work the way you want. You could add an MCP for a task queue and let the model work each item in the task queue. The tasks could be added by a deterministic system i.e. your harness.

pishpash 14 hours ago|||

Can you not have it write your harness for you, or have it be the first step? You can push your own determinism where you need, surely.

svachalek 13 hours ago||

True. The prompt reads: Run the following Python: ```

zapataband1 10 hours ago||

[flagged]

BalinKing 8 hours ago||

From the site guidelines (https://news.ycombinator.com/newsguidelines.html):

> Be kind. Don't be snarky. Converse curiously; don't cross-examine. Edit out swipes.

rnxrx 17 hours ago||

I wonder if a part of the problem isn't just the misapplication of LLMs in the first place. As has been mentioned elsewhere, perhaps the agent's prompt should be to write code to accomplish as much of the task in as repeatable/verifiable/deterministic a way as possible. This would hopefully include validation of the agent's output as well. The overall goal would be to keep the LLM out of doing processing that could be more efficiently (and often correctly) handled programmatically.

chrismarlow9 16 hours ago||

100% agreed. use the non-deterministic thing that is right 90% of the time to generate a deterministic thing that is right 100% of the time. one of the key things I add to my prompts is:

- Please consult me when you encounter any ambiguous edge cases

Attaching the AI to production to directly do things with API calls is bad. For me the only use case where the app should do any AI stuff is with reading/categorizing/etc. Basically replacing the "R" in old CRUD apps. If you want to use that same new AI based "R" endpoint to auto fill forms for the "C", "U", and "D" based on a prompt that's cool, but it should never mutate anything for a customer before a human reviews it. Basically CRUD apps are still CRUD apps (and this will always be true), they just have the benefit of having a very intelligent "R" endpoint that can auto complete forms for customers (or your internal tooling/Jenkins pipelines/etc), or suggest (but never invoke) an action.

TZubiri 12 hours ago||

> Please consult me when you encounter any ambiguous edge cases

Why not check the logprobs of the output and take action when the prob of the first and second most likely token is too similar? (or below a certain threshold?

jatora 12 hours ago||

because this is manual? are you an llm?

vishvananda 15 hours ago|||

I think there is a flow in most organizations from:

llm -> prompt -> result

llm -> prompt + prompt encoded as skill -> result

llm -> prompt + deterministic code encoded as skill -> result

I do think prompting to generate code early can shortcut that path to deterministic code, but we're still essentially embedding deterministic code in a non-deterministic wrapper. There is a missing layer of determinism in many cases that actually make long-horizon tasks successful. We need deterministic code outside the non-deterministic boundary via an agentic loop or framework. This puts us in a place where the non-deterministic decision making is sandwiched in between layers of determinism:

deterministic agentic flows -> non-deterministic decision making -> deterministic tools

This has been a very powerful pattern in my experiments and it gets even stronger when the agents are building their own determinism via tools like auto-researcher.

VMG 16 hours ago|||

The problem is that often the program runs into some edge case that requires interpretation, at which point one is tempted to let the LLM deal with the edge case, at which point one is tempted to let the LLM deal with the whole loop and let it do the tool calls

Fishkins 16 hours ago||

Agreed. I think the approach described here is promising. Most of the workflow is deterministic and includes safeguards, but an LLM is invoked in the one case where it's really useful.

https://lethain.com/agents-as-scaffolding/

evilelectron 14 hours ago|||

This is exactly how I did my last project of automating the generation of an interface library between a server that controls hardware and the mobile app.

The hardware control team delivers a spec as a document and spreadsheet. The mobile team was using that to code the interface library and validating their code against the server. I converted the document to TSV, sent some parts to Claude and have it write a parser for the TSV keeping all the nuances of human written spec. It took more than 150 iterations to get the parser to handle all edge cases and generate an intermediate output as JSON. Then Claude helped me write a code generator using some custom glue on top of Apollo to generate the code that is consumed by the mobile app.

This whole pipeline runs as part of Github actions and calls Claude only when our library validator fails. There is an md file which is sent to Claude on failure as part of the request to figure out what went wrong, propose a solution and create a PR. This is followed by a human review, rework and merge. Total credits consumed to get here < $350.

memjay 6 hours ago|||

This has been our experience as well. Initially we had a list of tools that the agent could use to manipulate a data structure in certain ways. This approach was quite brittle. Now we are using a small DSL (domain specific language) and a single tool where the agent can input scripts written in the DSL. We are getting more dynamic use-cases now and wrong syntax can easily be catched by the parser and relayed to the agent.

khasan222 14 hours ago|||

Completely agree! People tend to forget we are non deterministic too! Yet we are able to write code fine, and fairly reliably by using tools that can help keep us fairly honest.

I think most problems with ai tend to be around can you deterministically test the thing you are asking it to do?

How many of us would never ever show work, without going to check the thing we just built first?

cluckindan 14 hours ago||

> can you deterministically test the thing you are asking it to do?

Of course: have it write tests first; and run them to check its work.

Works well for refactoring, but greenfield implementations still rely on a spec that is guaranteed to be incomplete, overcomplete and wrong in many ways.

khasan222 12 hours ago|||

Well if the spec is incomplete it sounds like you should lower scope for the AI, and then go from there. I wouldn't be too keen to give a junior engineer free reign and expect awesomeness

pishpash 14 hours ago|||

You can't ask something to check its own work without external reward/penalty. It'll cheat.

khasan222 9 hours ago||

Weirdly, and i fully think this is just some cognitive bias I don't have the knowledge to name, the ai seems very happy to please me. Like when it gets something done in one shot, it seems very happy to do so.

groovetandon 14 hours ago|||

This is so true have been working on a project for exactly this principle -

https://www.decisional.com/blog/workflow-automation-should-b...

I think there is a fundamental incentive problem - code + llm + harness is bound to be more efficient but the labs want you to burn tokens so they are not going to tell you to use the code, just burn more tokens. They are asking us to forget about the token cost and reliability for now - model will become better.

This means that most people just believe that their agent should just be able to do anything with the help of some Model fairy dust with prompts + skills.

People need to watch their agents fail in production to be able to come to the right conclusion unfortunately.

user34283 59 minutes ago||

Skills are not fairy dust but a combination of prompts and deterministic code, so that you get the best of both worlds.

Eg. Loop in the code, process the subagents non-deterministic response for the individual task.

This takes 10 minutes to set up, you just need to run something like /skill-builder and describe the desired workflow.

I imagine many people just don’t know that it’s possible. I only discovered it a few days ago myself.

It worked on the first try.

marcus_holmes 9 hours ago|||

We have a rule that the LLM cannot perform any actions that result in actual money or stuff moving. Those can only be done by API calls that have lots of validation and checks on them, and adding or changing an API call is gated behind human review. The LLM is then free to make as many API calls as it likes, we're confident that it can't screw anything up too badly.

foolserrandboy 16 hours ago|||

yup, the standard way of thinking about agents seems backwards and probably costly. Use LLMs to write scripts, then stick all your scripts in your own looping harness and call out for LLMs for those parts that are too hard to automate with some deterministic validation at the end.

nixpulvis 15 hours ago|||

My agents often write themselves scripts. Isn't that effectively what you're asking for? Prompting for scripts can also be a useful time and accuracy tactic when you know it'll be a good fit for it.

falkensmaize 13 hours ago|||

The problem is that code it spits out on the fly is untested and untrustworthy. Identify the parts of your workflow that could be accomplished with regular code - write and unit test that code, with LLM help if you want, and use the llm as the orchestrator only.

sisve 15 hours ago|||

Yeah, the problem is that I do not think the agents is good at reusing scripts and stitching it together.At least for me it's recreating to much similar. I hope we will see platforms like windmill.dev find the optimal solution for this. I have not been able to test it enough. But have a platform that gives you some observability out of the box and protect secrets from llm is nice

reddit_clone 13 hours ago||

I noticed that too. Unless you _ask_ for a script, they throw away the scripts they write.

They are particularly bad at complex multiline parsing. Writing all sorts of weird/crude python/awk scripts and getting confused in the process.

I wish they would use Perl6/Grammer or Haskell/Parsec or similar and write better parsing scripts.

user3939382 14 hours ago||

> write code to accomplish as much of the task in as repeatable/verifiable/deterministic a way as possible

Correct. The concept of having probabilistic output with deterministic acceptance “guardrails” is illogical. If the domain resists deterministic modeling such that you’re using an LLM, the guardrails don’t magically gain that capability.

bwestergard 18 hours ago||

I agree with the sentiment, but I think the conclusion should be altered. When you hit the limit of prompting, you need to move from using LLMs at run time to accomplish a task to using LLMs to write software to accomplish the task. The role of LLMs at run time will generally shrink to helping users choose compliant inputs to a software system that embodies hard business rules.

scrappyjoe 17 hours ago||

I’ve had a couple of weeks of downtime at work, so I decided to incorporate agents into my work processes - things like note taking, task tracking, document management.

Your comment EXACTLY mirrors my experience. Week 1 was ever expanding prompts, and degrading performance. Week 2 has been all about actually defining the objects precisely (notes, tasks, projects, people etc) and defining methods for performing well defined operations against these objects. The agent surface has, as you rightly point out, shrunk to a translation layer that converts natural language to commands and args that pass the input validator.

sowbug 16 hours ago|||

A full-circle system prompt would be to "find every opportunity to put yourself out of your job by automating it away. When you are given a question that code can answer, answer the question by writing code and running it to obtain the result."

Such an LLM might have fared better with the strawberry test.

Imanari 18 minutes ago||

That’s exactly the approach of smolagemts. The only “tool“ available is writing python code

edgarvaldes 17 hours ago|||

Some have expressed the opinion in this forum that the future of software lies in programs that are created and adapted at runtime, using genAI. I don't know how far we are from that.

aleksiy123 16 hours ago|||

It’s already here the question is just to what extent?

Are google search results modifying your software at runtime?

Take or agent chat for example, the output text is a ui, agents can generate charts and even constrained ui elements.

Isn’t that created and adapted at run time?

If you mean like agents live modifying your code. I think that’s pretty much here as well. Can read the logs and send prs.

The only thing is how fast that loop will execute from days or hours to mins or seconds, and what validation gates it needs to pass.

My git repo is pretty much self modifying personal software at this point, that I interface through the ide chat window.

But I don’t think we will ever lose the intermediary deterministic language (code) between the llm and the execution engine.

It would be prohibitively expensive to run everything through models all the time.

But I am starting to think we need a more precise language than English when talking with LLMs. That can do both precision and ambiguity when you need either.

jmaw 16 hours ago|||

Some kind of "code", you could say

aleksiy123 13 hours ago||

Yes but more declarative vs imperative.

I say what the llm says how.

pishpash 13 hours ago||

Not that long ago the workflow was to turn code comments into code. Maybe leave some comments as is now.

pishpash 13 hours ago|||

Sounds like assemblers bemoaning loss of control to C. The solution was inline assembly...

mjr00 17 hours ago||||

> Some have expressed the opinion in this forum that the future of software lies in programs that are created and adapted at runtime, using genAI.

Good luck with that. Users will flood you with complaints if a button moves 5px to the left after a design update. A program that is generated at runtime, with not just a variable UI but also UX and workflows, would get you death threats.

hilariously 17 hours ago|||

I think many software adjacent folks are super excited because they can now have the personalized toothbrush they keep asking people to make for them.

The problem is that outside of that most people want boring and regular interfaces so they can get in and solve the problem and get out - they don't want to "love" it or care if its "sexy" they want it to work and get out of the way.

LLMs transmogrifying your software at ever request assumes people are software architects and creators who love the computer interface, and that just doesn't describe the bulk of the population.

Most people using computers use the to consume things or utilize access to things, not for their own sake, and they certainly don't think "what if I just had code to do x..." unless x is make them a lot of money.

munk-a 14 hours ago|||

A program that is generated at runtime is fine (we have interpreted languages and often compile on demand) - the issue is with the non-deterministic nature of the output.

I think the core issue is that non-deterministic output is great for a chatbot experience where you want unpredictable randomness so it feels less like talking to the mirror - but when it comes to coding I think we're pretty fundamentally misaligned in sticking to that non-deterministic approach so firmly.

cassianoleal 15 hours ago|||

So we're back to vim over ssh in production, only without a human with _some semblance_ of judgement in the loop?

QuercusMax 15 hours ago|||

I've seen cases where models will get stuck in a particular mode of problem solving and need a nudge to tell them to move to a new mode. For example, instead of trying to massage a bunch of system service configs to handle hot-plug/unplug of an audio stream, what I really needed was to just write a couple dozen lines of Python to handle stuff.

I just had Claude write itself a couple shell scripts to handle a bunch of common cases (like running tests) in my workflow where it just couldn't figure it out efficiently. Now it just runs those tools and sets things up instead of spinning in circles for half an hour.

Every time it tries to ask me if it can run some one-off crazy shell or python one-liner to do something, I've started asking myself if I should have it write a tool I can auto-approve instead.

3uba 16 hours ago||

[dead]

jerf 17 hours ago||

This is why I frequently refer to "next generation AIs" that aren't just LLMs. LLMs are pretty cool and I expect that even if we see no further foundational advancement in AIs that we're going to continue to see them exploited in more interesting ways and optimized better. Even if the models froze as they are today, there's a lot more value to be squeezed out of them as we figure out how to do that.

However, there are some things that I think need a foundational next-generation improvement of some sort. The way that LLMs sort of smudge away "NEVER DO X" and can even after a lot of work end up seeing that as a bit of a "PLEASE DO X" seems fundamental to how they work. It can be easy to lose track of as we are still in the initial flush of figuring out what they can do (despite all we've already found), but LLMs are not everything we're looking for out of AI.

There should be some sort of architecture that can take a "NEVER DO X" and treat it as a human would. There should be some sort of architecture that instead of having a "context window" has memory hierarchies something like we do, where if two people have sufficiently extended conversations with what was initially the same AI, the resulting two AIs are different not just in their context windows but have actually become two individuals.

I of course have no more idea what this looks like than anyone else. But I don't see any reason to think LLMs are the last word in AI.

cheesecakegood 12 hours ago||

Actual memory, in my opinion. Right now memory is broadly speaking like a system of sticky notes the AI writes itself and checks every time, rather than an integrative system that allows learning and can trigger more flexibly.

cultofmetatron 14 hours ago||

heres a fun one for you https://www.youtube.com/watch?v=kYkIdXwW2AE&t=315s

cloaky233 32 minutes ago||

It's not that agents don't need more prompts, actually breaking the prompt into a dynamically changing prompt and a static prompt combination does resolve most of the issues. Control flow on the other hand is harnessing + context building, which is one major part of agentic workflows. So I believe a "optimized" combination of both is what we should be looking for.

throawayonthe 34 minutes ago||

i gave in and bought a month of claude (it really is a slot machine don't do it if you have an addictive personality lol) to vibecode a bit, and the Superpowers skill set is cool and all but it really seems like something that should be turned into a program

hmmmmmm maybe i could vibecode a harness based on that pi thing i've heard about, and integrate it closer with jj instead of relying on llms knowing how to use it, and make certain stages guaranteed to run... oh dear

edit: also i can't bring myself to believe the 'ultimate' form or whatever stabilizes out will be chat-based interfaces for coding and code generation

i think it's just that openai happened to strike gold with ChatGPT and nobody has time to figure anything else out because they've got to get the bazillion investor dollars with something that happens to kinda work

also afaiu all these instruct models are based on 'base' models that 'just' do text prediction, without replying with a chat format; will we see code generation models that output just code without the chat stuff?

gck1 13 hours ago||

As someone who went full circle prompt-enforcement > deterministic flow > prompt-enforcement, I disagree.

The reason why "DO NOT SKIP" fails is because your agent is responsible for too many things and there's things in context that are taking away the attention from this guidance.

But nobody said the agent that does enforcement must be the same agent that builds. While you can likely encode some smart decision making logic in your deterministic control flow, you either make it too rigid to work well, or you'll make it so complex that at that point, you might as well just use the agent, it will be cheaper to setup and maintain.

You essentially need 3 base agents:

- Supervisor that manages the loop and kicks right things into gear if things break down

- Orchestrator that delegates things to appropriate agents and enforces guardrails where appropriate

- Workers that execute units of work. These may take many shapes.

ex-aws-dude 12 hours ago||

Exactly, just keep adding more agents

SrslyJosh 11 hours ago||

I can't tell if this is satire or not. Well done!

baxtr 4 hours ago||

I think the key question is: How can you be sure the supervisor/orchestrator agents are reliable? You are just pushing the complexity down into another layer.

JohnMakin 16 hours ago||

> Imagine a programming language where statements are suggestions and functions return “Success” while hallucinating. Reasoning becomes impossible; reliability collapses as complexity grows.

This is essentially declarative programming. Most traditional programming is imperative, what most developers are used to - I give the exact set of instructions and expect them to be obeyed as I write them. Agents are way more declarative than imperative - you give them a result, they work on getting that result. Now the problem of course, is in something declarative like say, SQL, this result is going to be pretty consistent and well-defined, but you're still trusting the underlying engine on how to go about it.

Thinking about agents declaratively has helped me a lot rather than to try to design these rube-goldberg "control" systems around them. Didn't get it right? Ok, I validated it's not correct, let's try again or approach it differently.

If you really need something imperative, then write something imperative! Or have the agent do so. This stuff reads like trying to use the wrong tool for the job.

Terr_ 10 hours ago||

> This is essentially declarative programming.

I think it's step more-abstract that that, we're doing... How about "narrative programming"? (Though we could debate whether "programming" is still an applicable word.)

Yes, it may look like declarative programming, but it's within an illusion: We aren't aren't actually describing our goals "to" an AI that interprets them. Instead, there's a story-document where our human stand-in character has dialogue to a computer-character, and up in the real world we're hoping that the LLM will append more text in a way that makes a cohesive longer story with something useful that can be mined from it.

It's not just an academic distinction, if we know there's a story, that gives us a better model for understanding (and strategizing) the relationship between inputs and outputs. For example, it helps us understand risks like prompt-injection, and it provides guidance for the kinds of training data we do (or don't) want it trained on.

JohnMakin 10 hours ago||

I dont hate that distinction, I just think a lot of people are approaching this from an imperative framework that might not fit.

repelsteeltje 16 hours ago|||

I was thinking of declarative, but PROLOG rather than SQL. So with actual control flow and reasoning capabilities.

And then you run into similar issues as the llm does, like silent failures, loops, contradictions unless you're very careful.

The essence might be the same closed world assumption problem. In llm case this manifests as hallucination rather that admitting it does not know.

PaulStatezny 11 hours ago|||

I agree. But you can speak imperatively to agents as well ("Here are specific steps; follow them") and they can still screw up. :) I think what you're looking for is determinism, not imperativism.

And to your point: instructing a (non-deterministic) LLM declaratively ("get me to this end state") compounds the likelihood of going off the rails.

JohnMakin 10 hours ago||

I don’t think I’m confusing the two but it is an issue. See another comment I made in a sibling comment - terraform is a great example or something that is declarative, and also non deterministic. You can’t control upstream api/provider changes even between two plans happening simultaneously - thats a lot what working with agents feels like to me.

miltonlost 15 hours ago||

SQL's declarativeness is also based on the mathematics of relational algebra, so it will return the same result every time. Will it return it in the same amount of time every single query? No, that depends on indexing and database size. But the query itself won't be altered in the same way an LLM would be.

JohnMakin 15 hours ago||

Engines that use SQL can vary drastically in how they handle strings, floating points, etc., where identical SQL queries on identical data absolutely can return different results, which is why I mentioned the engine underneath - LLM's being nondeterministic in addition to declarative is kind of tangential to the point I was trying to make.

It is the same in terraform - yes, the HCL spec defines things very precisely, but you're kind of at the mercy of how the provider and provider API decide how to handle what you wrote, which can be very messy and inconsistent even when nothing changed on your side at all. LLM/agent usage feels a lot like that to me, in the sense it's declarative and can be a bit lossy. As a result there are things I could technically do in terraform but would never, because I need imperativeness.

My main point being, I think people are trying to ram agents into a ton of cases where they might not necessarily need or even want to be used, and stuff like this gets written. Maybe not, but I see it day to day - for instance, I have a really hard time convincing coworkers that are complaining about the reliability of MCP responses with their agents, that they could simply take an API key, have the agent write a script that uses it, and strictly bound/define the type of response format they want, rather than let the agent or server just guess - for some reason there is some inclination to "let the agent decide how to do everything."

I think that's probably what this article is getting at, but, I am saying that trying to create these elaborate control flows with validation checks everywhere to reign in an unruly application making dumb decisions, why not just use it to write deterministic automation instead of using agent as the automation?

dkersten 2 hours ago||

This is something I realised late last year while using Claude Code. The LLM shouldn't be the one in control of the workflow, because the LLM can make mistakes, skip steps, hallucinate steps, etc. Its also wasteful of tokens.

I'm a firm believer that a "thin harness" is the wrong approach for this reason and that workflows should be enforced in code. Doing that allows you to make sure that the workflow is always followed and reduces tokens since the LLM no longer has to consider the workflow or read the workflow instructions. But it also allows more interesting things: you can split plans into steps and feed them through a workflow one by one (so the model no longer needs to have as strong multi-step following); you can give each workflow stage its own context or prompts; you can add workflow-stage-specific verification.

Based on my experience with Claude Code and Kilo Code, I've been building a workflow engine for this exact purpose: it lets you define sequences, branches, and loops in a configuration file that it then steps through. I've opted to passing JSON data between stages and using the `jq` language for logic and data extraction. The engine itself is written in (hand coded; the recent Claude Code bugs taught me that the core has to be solid) Rust, while the actual LLM calls are done in a subprocess (currently I have my own Typescript+Vercel AI SDK based harness, but the plan is to support third party ones like claude code cli, codex cli, etc too in order to be able to use their subscriptions).

I'm not quite ready to share it just yet, but I thought it was interesting to mention since it aims to solve the exact problem that OP is talking about.

user34283 40 minutes ago|

I‘ve recently started to use skills and so far it’s been working great.

Your agent can write a python script to loop and simply call „claude -p“ or „codex exec“.

For simple workflows this seems good enough and can be set up in 10 minutes without third party software.

What do you think?

isityettime 16 hours ago|

Afaict all harnesses are wrong in this respect, some of them deeply so.

Slash commands, for instance, are a misfeature. I should never have to wait for the chatbot finish a turn so that I can check on the status of my context window or how much money I've spent this session. Control should be orthogonal to the chat loop.

Even things that have nothing to do with controlling the text generator's input and output are entangled with chat actions for no good reason except "it's a chat thing, let's pretend we're operating an IRC bot".

There are a zillion LLM agents out there nowadays, but none of them really separate control from the agent loop from presentation well. (A few do at least have headless modes, which is cool.)

dnautics 16 hours ago||

> Slash commands, for instance, are a misfeature. I should never have to wait for the chatbot finish a turn so that I can check on the status of my context window or how much money I've spent this session. Control should be orthogonal to the chat loop.

I get what you're trying to say but in practice architecting what you propose is considerably more difficult. Why not build it and try to get hired by one of the bigcos?

isityettime 15 hours ago|||

I don't think the basic architecture principles are novel. The big AI labs and other large tech companies already have engineers who can see this, without a doubt. But the AI labs clearly don't care if their LLM agents are just big balls of mud, and the big tech companies priorities mostly lie elsewhere, too.

They just want features. They don't really care about duplicated work, so half of them reinvent the TUI rendering wheel. Pluggability is something that might be actually hostile to their interests in lock-in. And the AI labs probably think "after a couple more scaling cycles, our models will be so good that our agents can just rewrite themselves from scratch"; until they hit a compute or power wall, it always looks rational to them to defer rearchitecting.

Another real possibility is that if you work on an agent with a really clean architecture and publish it in hopes of getting hired by some AI company, all of them think "that looks great, but we don't want to rearchitect right now". Your code winds up in the training set, and a year and a half from now, existing agents can "one-shot" rewrites along the lines of your design because they're "smarter".

As for me, I'm not that interested, personally. There are other things I want to build and I'm working on those.

gf000 3 hours ago|||

In what way would it be more complicated? This is pretty basic concurrent programming, we routinely have much much more complex concurrent designs..

Hell, a telegram bot can handle that just fine.

the_duke 15 hours ago|||

In codex CLI /status works just fine during a turn.

Other things don't though.

user34283 15 hours ago||

I use the Codex desktop app.

In the GUI I can see the context indicator and usage stats.

It also makes it easier to jump between conversations and see the updates.

Sometimes I use Claude Code or opencode in the terminal, and my experience is much poorer compared to the Codex desktop app.

More comments...