Posted by salkahfi 4 days ago
- It's short and to the point
- It's actionable in the short term (make sure the tasks per session aren't too difficult) and useful for researchers in the long term
- It's informative on how these models work, informed by some of the best in the business
- It gives us a specific vector to look at, clearly defined ("coherence", or, more fun, "hot mess")
- Merge amendments up into the initial prompt.
- Evaluate prompts multiple times (ensemble).
This is very useful for things that take time to verify, we have CI stuff that takes 2-3 hours to run and I hate when those fails because of a syntax error.
If future AI only manages to solve the variance problem, then it will have problems related to bias.
If future AI only manages to solve the bias problem, then it will have problems related to variance.
If problem X is solved, then the system that solved it won't have problem X. That's not very informative without some idea of how likely it is that X can or will be solved, and current AI is a better prior than "something will happen".
Exactly, the authors argument would be much better qualified by addressing this assumption.
> current AI is a better prior than "something will happen".
“Current AI” is not a prior, its a static observation.
Coherence requires 2 opposing forces to hold coherence in one dimension and at least 3 of them in higher dimensions of quality.
My team wrote up a paper titled "If You Want Coherence, Orchestrate a Team of Rivals"[1] because we kept finding that upping the reasoning threshold resulted in less coherence - more experimentation before we hit a dead-end to turn around.
So we had a better result from using Haiku (we fail over to Sonnet) over Opus and using a higher reasoning model to decompose tasks rather than perform each one of them.
Once a plan is made, the cheaper models do better as they do not double-think their approaches - they fail or they succeed, they are not as tenacious as the higher cost models.
We can escalate to higher authority and get out of that mess faster if we fail hard and early.
The knowledge of how exactly failure happened seems to be less useful to the higher reasoning model over the action biased models.
Splitting up the tactical and strategic sides of the problem, seems to work similarly to how Generals don't hold guns in a war.
This seems very basic to any kind of information processing beyond straight shot predictable transforms.
Expansion and reduction of possibilities, branches, scope, etc.
Biological and artificial neural networks converging into multiple signals, that are reduced by competition between them.
Scientific theorizing, followed by experimental testing.
Evolutionary genetic recombination and mutation, winnowed back by resource competition.
Generation, reduction, repeat.
In a continually coordinated sense too. Many of our systems work best by encouraging simultaneous cooperation and competition.
Control systems command signal proportional to demand, vs. continually reverse-acting error feedback.
Yes, this is not some sort of hard-fought wisdom.
It should be common sense, but I still see a lot of experiments which measure the sound of one hand clapping.
In some sense, it is a product of laziness to automate human supervision with more agents, but on the other hand I can't argue with the results.
If you don't really want the experiments and data from the academic paper, we have a white paper which is completely obvious to anyone who's read High Output Management, Mythical Man Month and Philosophy of Software Design recently.
Nothing in there is new, except the field it is applied to has no humans left.
By basic I didn't mean uninteresting.
In fact, despite the pervasiveness and obviousness of the control and efficiency benefits of push-pull, generating-reducing, cooperation-competition, etc., I don't think I have ever seen any kind of general treatment or characterization that pulled all these similar dynamics together. Or a hierarchy of such.
> In some sense, it is a product of laziness to automate human supervision with more agents, but on the other hand I can't argue with the results.
I think it is the fact that the agents are operating coherently with the respective complementary goals. Whereas, asking one agent to both solve and judge creates conflicting constraints before a solution has begun.
Creative friction.
I am reminded of brainstorming sessions, where it is so important to note ideas, but not start judging them, since who knows what crazy ideas will fit or spark together. Later they can be selected down.
So we institutionalize this separation/staging with human teams too, even if it is just one of us (within our context limits, over two inference sessions :).
I think this is twofold:
1. Advanced intelligence requires the ability to traverse between domain valleys in the cognitive manifold. Be it via temperature or some fancy tunneling technique, it's going to be higher error (less coherent) in the valleys of the manifold than naive gradient following to the local minima.
2. It's hard to "punch up" when evaluating intelligence. When someone is a certain amount smarter than you, distinguishing their plausible bullshit from their deep insights is really, really hard.
You can have a vanishingly small error and an incoherence at its max.
That would be evidence of perfect alignment (zero bias) and very low variance.
Couldn't you have just said "know about a lot of different fields"? Was your comment sarcastic or do you actually talk like that?
The hallmark of intelligence in this scenario is not just being able to make the connections, but being able to pick the right ones.
Sometimes things that look very different actually are represented with similar vectors in latent space.
When that happens to us it "feels like" intuition; something you can't really put a finger on and might require work to put into a form that can be transferred to another human that has a different mental model
Which is why, just occasionally, they're right, but mostly by accident.
Insights are “deep” not on their own merit, but because they reveal something profound about reality. Such a revelation is either testable or not. If it’s testable, distinguishing it from bullshit is relatively easy, and if it’s not testable even in principle, a good heuristic is to put it in the bullshit category by default.
Smaller prompts and fewer tools tends to be more stable. I try to stay within 1000 tokens and 10 tools for a single inference pass. I become visibly amused when I read many of the system prompts out there. Anthropomorphism is the biggest anti pattern with these models. It's a very easy and comfortable trap to fall into.
The core issue I see with coding agents is that the moment you read a file, you've polluted the context in terms of token coherence. It's probably not critical in most cases, but it's safer to pretend like it is. Recursive/iterative decomposition of the problem is the only thing I've seen so far that can scale arbitrarily. For example, if you invoke a sub agent every time you read a file, you can reduce the impact to the token budget of the caller by orders of magnitude. The callee can return a brief summary or yes/no response to the caller after reading 500kb of source. This applies at each level of recursion and can compound dramatically (exponentially) over just a few nested calls.
However, I think producing detailed enough specification requires same or even larger amount of work than writing code. We write rough specification and clarify these during the process of coding. I think there are minimal effort required to produce these specification, AI will not help you speed up these effort.
The nice thing about code compared to other notation is that it's useful on its. You describe an algorithm and the machine can then solve the problem ad infinitum. It's one step instead of the two step of writing a spec and having an LLM translate it, then having to verify the output and alter it.
Assembly and high level languages are equivalent in terms of semantics. The latter helps in managing complexity, by reducing harmful possibilities (managing memory, off-by-one errors) and presenting common patterns (iterators/collections, struct and other data structures, ....) so that categories of problems are easily solved. There's no higher level of computing model unlocked. Just faster level of productivity unlocked by following proven patterns.
Spec driven workflow is a mirage, because even the best specs will leave a lot of unspecified details. Which are crucial as most of programming is making the computer not do the various things it can do.
This is a very stimulating way of putting it!
My particular hypothesis on this is something that feels a little bit like python and ruby, but has an absolutely insane overkill type system to help guide the AI. I also threw in a little lispiness on my draft: https://github.com/jaggederest/locque/
Also, they rely surprisingly closely on "good" code patterns, like comments and naming conventions.
So if anything, a managed language [1] with a decent type system and not a lot of features would be the best, especially if it has a lot of code in its training data. So I would rather vote on Java, or something close.
[1] reasoning about life times, even if aided by the compiler is a global property, and LLMs are not particularly good at that
On the other hand: the usefulness of LLMs will always be gated by their interface to the human world. So even if their internal communication might be superseded at some point. Their contact surface can only evolve if their partners/subjects/masters can interface
I've had comical instances where asking an agent to "perform the refactor within somespec.md" results in it ... refactoring the spec as opposed to performing a refactor of the code mentioned in the spec. If I say "Implement the refactor within somespec.md" it's never misunderstood.
With LLMs _so_ strongly aligned on language and having deep semantic links, a hypothetical prompt compiler could ensure that your intent converts into the strongest weighted individual words to ensure maximal direction following and outcome.
Intent classification (task frame) -> Reference Binding (inputs v targets) -> high-leverage word selection .... -> Constraints(?) = <optimal prompt>
I recently used Claude for a refactor. I had an exact list of call sites, with positions etc. The model had to add .foo to a bunch of builders that were either at that position or slightly before (the code position was for .result() or whatever.) I gave it the file and the instruction, and it mostly did it, but it also took the opportunity to "fix" similar builders near those I specified.
That is after iterating a few times on the prompt (first time it didn't want to do it because it was too much work, second time it tried to do it via regex, etc.)
Our team has started dedicating much more time writing documentation for our SaaS app, no one seems to want to do it naturally, but there is very large potential for opening your system to machine automation. Not just for coding but customer facing tooling. I saw a preview of that possible future using NewRelic where they have an AI chat use their existing SQL-like query language to build tables and charts from natural language queries right in the web app. Theirs kinda sucks but there's so much potential there that it is very likely going to change how we build UIs and software interfaces.
Plus it also helps sales, support, and SEO having lots of documentation on how stuff works.
This is sort of a fundamental problem with all AI. If you tell a robot assistant to "make a cup of tea", how's it supposed to know that that implies "don't break the priceless vase in the kitchen" and "don't step on the cat's tail", et cetera. You're never going to align it well enough with "human values" to be safe. Even just defining in human-understandable terms what those values are is a deep existential question of philosophy, let alone specifying it for a machine that's capable of acting in the world independently.
I maintain ~100 custom skills (specialized prompts). Sometimes Claude reads a skill, understands it, then overthinks itself into "helpful" variations that break the workflow.
Has anyone else found prompt density affects coherence?
Ran it on my sessions. Result: none of skills scored STABLE. The structural predictors of high variance: Numbered steps without clear default, Options without (default) marker, Content >4k chars (overthinking zone), Missing constraint language
[1] https://github.com/anupamchugh/shadowbook (bd wobble)
The "mis-alignment" we do need to worry about is intentional. Naturally, the hyperscalers are deploying these models in order to benefit themselves. Ideally, customers will select models that are most grounded and accurate. In practice, there's a danger that people will select models that tell them what they want to hear, rather than what they should hear. We've seen this with journalism and social media.
The other danger is that absent a competitive marketplace for AI, a single corporation or a cartel will shape the narrative. The market valuations of some AI providers seem to be based on this assumption.
The probabilistic version of "Do No Harm" is "Do not take excessive risk of harm".
This should work as AIs become smarter because intelligence implies becoming better bayesians which implies being great at calibrating confidence intervals of their interpretations and their reasoning and basically gaining a superhuman ability for evaluating the bounds of ambiguity and risk.
Now this doesn't mean that AIs won't be misaligned, only that it should be possible to align them. Not every AI maker will necessarily bother to align them properly, especially in adversarial, military applications.
In practice, systematic misalignment (bias) is relatively easy to fix - you identify the pattern and add it to your prompt/context. "Always use our internal auth library" works reliably once specified.
Variance-dominated failures are a different beast. The same prompt, same context, same model can produce wildly different quality outputs on complex tasks. I've seen this most acutely when asking models to maintain consistency across multi-file changes.
The paper's finding that "larger models + harder problems = more variance" explains something I couldn't quite articulate before: why Sonnet sometimes outperforms Opus on specific workflows. The "smarter" model attempts more sophisticated solutions, but the solution space it's exploring has more local minima where it can get stuck.
One practical takeaway: decomposing complex tasks into smaller, well-specified subtasks doesn't just help with context limits - it fundamentally changes the bias/variance profile of each inference call. You're trading one high-variance call for multiple lower-variance calls, which tends to be more predictable even if it requires more orchestration overhead.