How does misalignment scale with model intelligence and task complexity?

Posted by salkahfi 4 days ago

How does misalignment scale with model intelligence and task complexity?(alignment.anthropic.com)

241 points | 79 comments

jmtulloss 4 days ago|

The comments so far seem focused on taking a cheap shot, but as somebody working on using AI to help people with hard, long-term tasks, it's a valuable piece of writing.

- It's short and to the point

- It's actionable in the short term (make sure the tasks per session aren't too difficult) and useful for researchers in the long term

- It's informative on how these models work, informed by some of the best in the business

- It gives us a specific vector to look at, clearly defined ("coherence", or, more fun, "hot mess")

kernc 4 days ago||

Other actionable insights are:

- Merge amendments up into the initial prompt.

- Evaluate prompts multiple times (ensemble).

sandos 3 days ago||

Sometimes when I was stressed, I have used several models to verify each others´ work. They usually find problems, too!

This is very useful for things that take time to verify, we have CI stuff that takes 2-3 hours to run and I hate when those fails because of a syntax error.

xmcqdpt2 3 days ago||

Syntax errors should be caught by type checking / compiling/ linting. That should not take 2-3 hours!

nth21 3 days ago||

There’s not a useful argument here. The article is using current AI to extrapolate future AI failure modes. If future AI models solve the ‘incoherence’ problem, that leaves bias as a primary source of failure (according to the author these are the only two possible failure modes apparently).

toroidal_hat 3 days ago||

That doesn't seem like a useful argument either.

If future AI only manages to solve the variance problem, then it will have problems related to bias.

If future AI only manages to solve the bias problem, then it will have problems related to variance.

If problem X is solved, then the system that solved it won't have problem X. That's not very informative without some idea of how likely it is that X can or will be solved, and current AI is a better prior than "something will happen".

nth22 3 days ago||

> That's not very informative without some idea of how likely it is that X can or will be solved

Exactly, the authors argument would be much better qualified by addressing this assumption.

> current AI is a better prior than "something will happen".

“Current AI” is not a prior, its a static observation.

gopalv 4 days ago||

> Making models larger improves overall accuracy but doesn't reliably reduce incoherence on hard problems.

Coherence requires 2 opposing forces to hold coherence in one dimension and at least 3 of them in higher dimensions of quality.

My team wrote up a paper titled "If You Want Coherence, Orchestrate a Team of Rivals"[1] because we kept finding that upping the reasoning threshold resulted in less coherence - more experimentation before we hit a dead-end to turn around.

So we had a better result from using Haiku (we fail over to Sonnet) over Opus and using a higher reasoning model to decompose tasks rather than perform each one of them.

Once a plan is made, the cheaper models do better as they do not double-think their approaches - they fail or they succeed, they are not as tenacious as the higher cost models.

We can escalate to higher authority and get out of that mess faster if we fail hard and early.

The knowledge of how exactly failure happened seems to be less useful to the higher reasoning model over the action biased models.

Splitting up the tactical and strategic sides of the problem, seems to work similarly to how Generals don't hold guns in a war.

[1] - https://arxiv.org/abs/2601.14351

Nevermark 4 days ago||

> Coherence requires 2 opposing forces

This seems very basic to any kind of information processing beyond straight shot predictable transforms.

Expansion and reduction of possibilities, branches, scope, etc.

Biological and artificial neural networks converging into multiple signals, that are reduced by competition between them.

Scientific theorizing, followed by experimental testing.

Evolutionary genetic recombination and mutation, winnowed back by resource competition.

Generation, reduction, repeat.

In a continually coordinated sense too. Many of our systems work best by encouraging simultaneous cooperation and competition.

Control systems command signal proportional to demand, vs. continually reverse-acting error feedback.

gopalv 4 days ago||

> This seems very basic

Yes, this is not some sort of hard-fought wisdom.

It should be common sense, but I still see a lot of experiments which measure the sound of one hand clapping.

In some sense, it is a product of laziness to automate human supervision with more agents, but on the other hand I can't argue with the results.

If you don't really want the experiments and data from the academic paper, we have a white paper which is completely obvious to anyone who's read High Output Management, Mythical Man Month and Philosophy of Software Design recently.

Nothing in there is new, except the field it is applied to has no humans left.

Nevermark 3 days ago||

> Yes, this is not some sort of hard-fought wisdom.

By basic I didn't mean uninteresting.

In fact, despite the pervasiveness and obviousness of the control and efficiency benefits of push-pull, generating-reducing, cooperation-competition, etc., I don't think I have ever seen any kind of general treatment or characterization that pulled all these similar dynamics together. Or a hierarchy of such.

> In some sense, it is a product of laziness to automate human supervision with more agents, but on the other hand I can't argue with the results.

I think it is the fact that the agents are operating coherently with the respective complementary goals. Whereas, asking one agent to both solve and judge creates conflicting constraints before a solution has begun.

Creative friction.

I am reminded of brainstorming sessions, where it is so important to note ideas, but not start judging them, since who knows what crazy ideas will fit or spark together. Later they can be selected down.

So we institutionalize this separation/staging with human teams too, even if it is just one of us (within our context limits, over two inference sessions :).

maxkfranz 4 days ago||

More or less, delegation and peer review.

CuriouslyC 4 days ago||

This is a good line: "It found that smarter entities are subjectively judged to behave less coherently"

I think this is twofold:

1. Advanced intelligence requires the ability to traverse between domain valleys in the cognitive manifold. Be it via temperature or some fancy tunneling technique, it's going to be higher error (less coherent) in the valleys of the manifold than naive gradient following to the local minima.

2. It's hard to "punch up" when evaluating intelligence. When someone is a certain amount smarter than you, distinguishing their plausible bullshit from their deep insights is really, really hard.

energy123 4 days ago||

Incoherence is not error.

You can have a vanishingly small error and an incoherence at its max.

That would be evidence of perfect alignment (zero bias) and very low variance.

booleandilemma 4 days ago|||

> the ability to traverse between domain valleys in the cognitive manifold.

Couldn't you have just said "know about a lot of different fields"? Was your comment sarcastic or do you actually talk like that?

reverius42 3 days ago||

I think they mean both "know about a lot of different fields" and also "be able to connect them together to draw inferences", the latter perhaps being tricky?

booleandilemma 3 days ago||

Maybe? They should speak more clearly regardless, so we don't have to speculate over it. The way you worded it is much more understandable.

pixl97 3 days ago||

There wasn't much room to speculate really, but requires some knowledge of understanding problem spaces, topology, and things like minima and maxima.

reverius42 3 days ago||

"inaccessible" rather than "ambiguous" -- but to the uninitiated they are hard to tell apart.

xanderlewis 4 days ago|||

What do 'domain valleys' and 'tunneling' mean in this context?

FuckButtons 4 days ago|||

So, the hidden mental model that the OP is expressing and failed to elucidate on is that llm’s can be thought of as compressing related concepts into approximately orthogonal subspaces of the vector space that is upper bounded by the superposition of all of their weights. Since training has the effect of compressing knowledge into subspaces, a necessary corollary of that fact is that there are now regions within the vector space that contain nothing very much. Those are the valleys that need to be tunneled through, ie the model needs to activate disparate regions of its knowledge manifold simultaneously, which, seems like it might be difficult to do. I’m not sure if this is a good way of looking at things though, because inference isn’t topology and I’m not sure that abstract reasoning can be reduced down to finding ways to connect concepts that have been learned in isolation.

esyir 4 days ago||||

Not the OP, but my interpretation here is that if you model the replies as some point in a vector space, assuming points from a given domain cluster close to each other, replies that span two domains need to "tunnel" between these two spaces.

esafak 4 days ago|||

A hallmark of intelligence is the ability to find connections between the seemingly disparate.

Earw0rm 3 days ago|||

That's also a hallmark of some mental/psychological illnesses (paranoid schizophrenia family) and use of certain drugs, particularly hallucinogens.

The hallmark of intelligence in this scenario is not just being able to make the connections, but being able to pick the right ones.

ithkuil 3 days ago||||

The word "seemingly" is doing a lot of work here.

Sometimes things that look very different actually are represented with similar vectors in latent space.

When that happens to us it "feels like" intuition; something you can't really put a finger on and might require work to put into a form that can be transferred to another human that has a different mental model

w10-1 3 days ago||||

Actually, a hallmark could be to prune illusory connections, right? That would decrease complexity rather than amplifying it.

esafak 3 days ago||

Yes, that also happens, for example when someone first said natural disasters are not triggered by offending gods. It is all about making explanations as simple as possible but no simpler.

TonyStr 4 days ago|||

Does this make conspiracy theorists highly intelligent?

gylterud 3 days ago||

No, but they emulate intelligence by making up connections between seemingly disparate things, where there are none.

Earw0rm 3 days ago||

They make connections but lack the critical thinking skills to weed out the bad/wrong ones.

Which is why, just occasionally, they're right, but mostly by accident.

p-e-w 4 days ago||

> When someone is a certain amount smarter than you, distinguishing their plausible bullshit from their deep insights is really, really hard.

Insights are “deep” not on their own merit, but because they reveal something profound about reality. Such a revelation is either testable or not. If it’s testable, distinguishing it from bullshit is relatively easy, and if it’s not testable even in principle, a good heuristic is to put it in the bullshit category by default.

CuriouslyC 4 days ago|||

This was not my experience studying philosophy. After Kant there was a period where philosophers were basically engaged in a centuries long obfuscated writing competition. The pendulum didn't start to swing back until Neitchze. It reminded me of legal jargon but more pretentious and less concrete.

root_axis 4 days ago||

It seems to me that your anecdote exemplifies the their point.

skydhash 4 days ago|||

The issue is the revelation. It's always individual at some level. And don't forget our senses are crude. The best way is to store "insights" as information until we collect enough data that we can test it again (hopefully without a lot of bias). But that can be more than a lifetime work, so sometimes you have to take some insights at face value based on heuristics (parents, teachers, elder, authority,...)

bob1029 3 days ago||

You simply can't have a single shot context with so many simultaneous constraints and expect to make forward progress. This cannot be solved with additional silicon, power or data.

Smaller prompts and fewer tools tends to be more stable. I try to stay within 1000 tokens and 10 tools for a single inference pass. I become visibly amused when I read many of the system prompts out there. Anthropomorphism is the biggest anti pattern with these models. It's a very easy and comfortable trap to fall into.

The core issue I see with coding agents is that the moment you read a file, you've polluted the context in terms of token coherence. It's probably not critical in most cases, but it's safer to pretend like it is. Recursive/iterative decomposition of the problem is the only thing I've seen so far that can scale arbitrarily. For example, if you invoke a sub agent every time you read a file, you can reduce the impact to the token budget of the caller by orders of magnitude. The callee can return a brief summary or yes/no response to the caller after reading 500kb of source. This applies at each level of recursion and can compound dramatically (exponentially) over just a few nested calls.

smy20011 4 days ago||

I think It's not because AI working on "misaligned" goals. The user never specify the goal clearly enough for AI system to work.

However, I think producing detailed enough specification requires same or even larger amount of work than writing code. We write rough specification and clarify these during the process of coding. I think there are minimal effort required to produce these specification, AI will not help you speed up these effort.

crabmusket 4 days ago||

That makes me wonder about the "higher and higher-level language" escalator. When you're writing in assembly, is it more work to write the code than the spec? And the reverse is true if you can code up your system in Ruby? If so, does that imply anything about the "spec driven" workflow people are using with AIs? Are we right on the cusp where writing natural language specs and writing high level code are comparably productive?

skydhash 4 days ago|||

Programming languages can be a thinking tool for a lot of tasks. Very much like a lot of notation, like music sheet and map drawing. A condensed and somewhat formal manner of describing ideas can increase communication speed. It may lack nuance, but in some case, nuance is harmful.

The nice thing about code compared to other notation is that it's useful on its. You describe an algorithm and the machine can then solve the problem ad infinitum. It's one step instead of the two step of writing a spec and having an LLM translate it, then having to verify the output and alter it.

Assembly and high level languages are equivalent in terms of semantics. The latter helps in managing complexity, by reducing harmful possibilities (managing memory, off-by-one errors) and presenting common patterns (iterators/collections, struct and other data structures, ....) so that categories of problems are easily solved. There's no higher level of computing model unlocked. Just faster level of productivity unlocked by following proven patterns.

Spec driven workflow is a mirage, because even the best specs will leave a lot of unspecified details. Which are crucial as most of programming is making the computer not do the various things it can do.

crabmusket 4 days ago||

> most of programming is making the computer not do the various things it can do

This is a very stimulating way of putting it!

jaggederest 4 days ago||||

I believe that the issue right now is that we're using languages designed for human creation in an AI context. I think we probably want languages that are optimized for AI written but human read code, so the surface texture is a lot different.

My particular hypothesis on this is something that feels a little bit like python and ruby, but has an absolutely insane overkill type system to help guide the AI. I also threw in a little lispiness on my draft: https://github.com/jaggederest/locque/

gf000 3 days ago||

I don't know, LLMs strive on human text, so I would wager that a language designed for humans would quite closely match an ideal one for LLMs. Probably the only difference is that LLMs are not "lazy", they better tolerate boilerplate, and lower complexity structures likely fit them better. (E.g. they can't really one-shot understand some imported custom operator that is not very common in its training data)

Also, they rely surprisingly closely on "good" code patterns, like comments and naming conventions.

So if anything, a managed language [1] with a decent type system and not a lot of features would be the best, especially if it has a lot of code in its training data. So I would rather vote on Java, or something close.

[1] reasoning about life times, even if aided by the compiler is a global property, and LLMs are not particularly good at that

hnaccount_rng 3 days ago|||

But that is leas fundamental then you make it sound. LLMs work well with human language because that’s all they are trained on. So what else _could_ an ideal language possible look like?

On the other hand: the usefulness of LLMs will always be gated by their interface to the human world. So even if their internal communication might be superseded at some point. Their contact surface can only evolve if their partners/subjects/masters can interface

dudeinhawaii 3 days ago|||

When I think of the effect of a single word on Agent behavior - I wonder if a 'compiler' for the human prompt isn't something that would benefit the engineer.

I've had comical instances where asking an agent to "perform the refactor within somespec.md" results in it ... refactoring the spec as opposed to performing a refactor of the code mentioned in the spec. If I say "Implement the refactor within somespec.md" it's never misunderstood.

With LLMs _so_ strongly aligned on language and having deep semantic links, a hypothetical prompt compiler could ensure that your intent converts into the strongest weighted individual words to ensure maximal direction following and outcome.

Intent classification (task frame) -> Reference Binding (inputs v targets) -> high-leverage word selection .... -> Constraints(?) = <optimal prompt>

charcircuit 4 days ago|||

If you are on the same wave length as someone you don't need to produce a full spec. You can trust that the other person has the same vision as you and will pick reasonable ways to implement things. This is one reason why personalized AI agents are important.

xmcqdpt2 3 days ago|||

As of today though, that doesn't work. Even straightforward tasks that are perfectly spec-ed can't be reliably done with agents, at least in my experience.

I recently used Claude for a refactor. I had an exact list of call sites, with positions etc. The model had to add .foo to a bunch of builders that were either at that position or slightly before (the code position was for .result() or whatever.) I gave it the file and the instruction, and it mostly did it, but it also took the opportunity to "fix" similar builders near those I specified.

That is after iterating a few times on the prompt (first time it didn't want to do it because it was too much work, second time it tried to do it via regex, etc.)

dmix 4 days ago|||

> I think producing detailed enough specification requires same or even larger amount of work than writing code

Our team has started dedicating much more time writing documentation for our SaaS app, no one seems to want to do it naturally, but there is very large potential for opening your system to machine automation. Not just for coding but customer facing tooling. I saw a preview of that possible future using NewRelic where they have an AI chat use their existing SQL-like query language to build tables and charts from natural language queries right in the web app. Theirs kinda sucks but there's so much potential there that it is very likely going to change how we build UIs and software interfaces.

Plus it also helps sales, support, and SEO having lots of documentation on how stuff works.

pixl97 3 days ago||

Detailed specification also helps root out conflicting design requirements and points at the desired behavior when bugs are actually found. It also helps when other stakeholders can read it and see misalignment with what their users/customers actually need.

hogehoge51 4 days ago|||

My thought too. To extend this coding agents will make code cheap, specifications cheaper, but may also invert the relative opportunity cost of not writing a good spec.

cobblestone32 3 days ago||

> The user never specify the goal clearly enough for AI system to work.

This is sort of a fundamental problem with all AI. If you tell a robot assistant to "make a cup of tea", how's it supposed to know that that implies "don't break the priceless vase in the kitchen" and "don't step on the cat's tail", et cetera. You're never going to align it well enough with "human values" to be safe. Even just defining in human-understandable terms what those values are is a deep existential question of philosophy, let alone specifying it for a machine that's capable of acting in the world independently.

anupamchugh 4 days ago||

The "natural overthinking increases incoherence" finding matches my daily experience with Claude.

I maintain ~100 custom skills (specialized prompts). Sometimes Claude reads a skill, understands it, then overthinks itself into "helpful" variations that break the workflow.

Has anyone else found prompt density affects coherence?

anupamchugh 3 days ago|

Following up - I built a tool "wobble"[1] to measure this: parses ~/.claude/projects/*.jsonl session transcripts, extracts skill invocations + actual commands executed, calculates Bias/Variance per the paper's formula.

Ran it on my sessions. Result: none of skills scored STABLE. The structural predictors of high variance: Numbered steps without clear default, Options without (default) marker, Content >4k chars (overthinking zone), Missing constraint language

[1] https://github.com/anupamchugh/shadowbook (bd wobble)

loudmax 3 days ago||

This paper indicates that we should probably be less fearful of Terminator style accidental or emergent AI-misalignment. At least, as far as the existing auto-regressive LLM architecture is concerned. We may want to revisit these concerns if and when other types of artificial general intelligent models are deployed.

The "mis-alignment" we do need to worry about is intentional. Naturally, the hyperscalers are deploying these models in order to benefit themselves. Ideally, customers will select models that are most grounded and accurate. In practice, there's a danger that people will select models that tell them what they want to hear, rather than what they should hear. We've seen this with journalism and social media.

The other danger is that absent a competitive marketplace for AI, a single corporation or a cartel will shape the narrative. The market valuations of some AI providers seem to be based on this assumption.

BenoitEssiambre 4 days ago||

This matches my intuition. Systematic misalignment seems like it could be prevented by somewhat simple rules like the hippocratic oath or Asimov's Laws of robotics or rather probabilistic bayesian versions of these rules that take into account error bounds and risk.

The probabilistic version of "Do No Harm" is "Do not take excessive risk of harm".

This should work as AIs become smarter because intelligence implies becoming better bayesians which implies being great at calibrating confidence intervals of their interpretations and their reasoning and basically gaining a superhuman ability for evaluating the bounds of ambiguity and risk.

Now this doesn't mean that AIs won't be misaligned, only that it should be possible to align them. Not every AI maker will necessarily bother to align them properly, especially in adversarial, military applications.

Soerensen 3 days ago||

The bias-variance framing here maps well to what I've observed building AI-assisted workflows.

In practice, systematic misalignment (bias) is relatively easy to fix - you identify the pattern and add it to your prompt/context. "Always use our internal auth library" works reliably once specified.

Variance-dominated failures are a different beast. The same prompt, same context, same model can produce wildly different quality outputs on complex tasks. I've seen this most acutely when asking models to maintain consistency across multi-file changes.

The paper's finding that "larger models + harder problems = more variance" explains something I couldn't quite articulate before: why Sonnet sometimes outperforms Opus on specific workflows. The "smarter" model attempts more sophisticated solutions, but the solution space it's exploring has more local minima where it can get stuck.

One practical takeaway: decomposing complex tasks into smaller, well-specified subtasks doesn't just help with context limits - it fundamentally changes the bias/variance profile of each inference call. You're trading one high-variance call for multiple lower-variance calls, which tends to be more predictable even if it requires more orchestration overhead.

leahtheelectron 4 days ago|

It's nice seeing this with Sohl-Dickstein as the last author after reading this blog post from him some time ago: https://sohl-dickstein.github.io/2023/03/09/coherence.html

More comments...