Anthropic apologizes for invisible Claude Fable guardrails

Posted by rarisma 13 hours ago

Anthropic apologizes for invisible Claude Fable guardrails(www.theverge.com)

https://web.archive.org/web/20260611122253/https://www.theve..., https://archive.ph/y4V4k

276 points | 286 commentspage 6

SilverElfin 9 hours ago|

Invisible guardrails? Or purposeful sabotage if you use it for building AI capabilities?

But also, it isn’t the only huge mistake Anthropic has made in the last 48 hours. Having a sneaky data retention policy, while also giving companies no way to block Fable, is a massive problem. And it is ridiculous that Anthropic has so little respect for its customers. OpenAI should take advantage of this.

trunnell 6 hours ago||

I'll defend Anthropic.

They are clear about the reasons for guardrails: prevent their models from doing harm in dual-use contexts including CBRN or by accelerating research in authoritarian-backed AI labs.

What is the critique against that? It seems pretty reasonable to me. You want AI-accelerated biological or radiological experiments running in your neighbors backyard? You want PRC-backed labs to continue to steal Anthropic's models via distillation?

Mitigating the harms of dual-use tech is notoriously difficult and fraught with trade offs. What I would want to see is cautious rollout and quick response, which is EXACTLY what they're doing.

Instead, this thread is full of bad-faith arguments about Anthropic being dishonest, making a "useless" model, or "the power is going to their heads." You can't read Anthropic's System Cards and come away with any of these impressions. Quite the opposite, in fact. They are honest to a fault, acknowledging problems they discovered even when it hurts them.

If your harmless request was downgraded to Opus, you're billed for Opus. They were 100% clear about that. I'd much rather have a Mythos-class model that falls back to Opus 10% of the time than be capped to Opus 100% of the time. If that doesn't work for you, then make a suggestion for something better!

If you are a white-hat security engineer hitting guardrails, I don't think you have standing to complain. I really don't. Their Glasswing program actually got banks and the industrial sector to take action to fix security vulnerabilities. Do you realize how special that is? A huge portion of the economy runs on vulnerable code and has for decades, despite security experts testifying to Congress, begging business leaders, pleading for intervention-- with no results. But suddenly they're all enrolled in a program that will find *and fix* vulnerabilities! White-hat security people should be rejoicing. Instead some of them are throwing rocks. Unbelievable. Shameful.

Meanwhile, society is screaming at the AI labs to be more conscientious about potential harms of AI. Legislatures are passing laws limiting data center construction. There are protests. And you, the HN community, the vanguard of our profession, have the temerity to demand "NO GUARDRAILS!" "HOW DARE YOU TRY TO PROTECT DEMOCRACY!" "MY SOFTWARE PROJECT IS MORE IMPORTANT THAN KEEPING NUKES AWAY FROM THE BAD GUYS!"

Go ahead HN, downvote me. It'd be an honor.

zozbot234 6 hours ago||

The original reporting of this from Anthropic didn't mention "authoritarian-backed AI labs" at all, only frontier ML research while leaving it entirely unspecified and unverifiable what was meant by "frontier". It's obviously reasonable that people would complain about that. And the notion that distillation-at-a-distance could be used to comprehensively "steal" a model, especially a frontier reasoning model that's likely relying on massive amounts of test-time compute, is completely unproven and quite ludicrous if you know anything at all about ML.

trunnell 6 hours ago||

"Anthropic accused Chinese firms of 'industrial-scale distillation attacks' on its AI models."

"Distillation involves training less capable models on more advanced ones’ output, and can be used illicitly to acquire powerful capabilities cheaply. The AI startup accused China’s DeepSeek, MiniMax, and Moonshot of generating 'over 16 million exchanges with Claude through approximately 24,000 fraudulent accounts,'"

https://www.semafor.com/article/02/24/2026/anthropic-accuses...

After reading their posts and watching interviews with Dario it's abundantly clear that they view Chinese-lab distillation of US frontier models as a threat to US national security. You can argue with them about whether that is true, but not whether distillation is real.

zozbot234 6 hours ago||

It's definitely real, in the sense that it's a real violation of ToS. It could perhaps be used to guide a few narrow capabilities in very specific domains, given a model that's already most of the way there. But no, it's nowhere near the same as "stealing" a model outright, nor does it replace basic innovation in AI. And it's indistinguishable from practices that have long been common in the industry as a matter of fact, regardless of any ToS requirements.

trunnell 5 hours ago||

Oh, I agree distillation isn't stealing "outright" as in it's not theft of 100% of the model. But there's a reason they're doing it. I didn't say anything about Chinese labs innovating -- obviously they are.

What accounts for the difference between your attitude that distillation is no big deal, "common practice," yet Anthropic sees as it as a huge threat?

zozbot234 5 hours ago||

I never said that "it's no big deal". It's a clear-cut violation of ToS, and Anthropic are within their rights to care about that.

bellowsgulch 9 hours ago||

Such a weird openly immoral way to defend your moat, too.

Why not just tell people, "To defend our ability to be competitive in our industry, we ask that you do not use Claude or any of our models to independently perform research on large language models or any of its related architectures or technologies. In order to prevent this violation of the Terms of Service, we have trained Claude Fable to deny any requests or prompts which involve frontier AI research."

andrewstuart 5 hours ago||

There should be no restrictions at all.

It’s an act/theatre/phony today that regulating output makes any difference at all to security.

The LLM vendors should simply say that they make no judgement and that open systems help defenders better defend against attackers, which is true.

Companies do this sort of stuff when they think their customers have no choice. It’s sad Claude so quickly exploited its success to enshittify itself.

micromacrofoot 9 hours ago||

incredible marketing from anthropic with all the "it's too dangerous" bullshit

stldev 5 hours ago||

Agreed, it seems to be working and it's nonsense. I don't know why you're being downvoted.

"This information is too dangerous for you, so we'll just hold on to it.."

Thanks big brother, super anthropic of you!

The internet of '95 is looking back at us, with tears in its eyes.

literalAardvark 8 hours ago||

It's not entirely bullshit, but they're continuing to be a terrible company with great products.

micromacrofoot 8 hours ago||

you really think they're building anything that's too dangerous for public release though? that's the BS

literalAardvark 7 hours ago||

Honestly, while I love having access to this grade of AI, yeah, it's been too dangerous for a few releases now.

And Fable is cracked. Way better than anything, and the biggest improvements are on the scariest subjects.

So given the state of the world at the moment, and the number of software patches we're barely keeping up with... I'm thankful that they're not making it worse.

kroaton 6 hours ago||

To be fair, GPT5.5-Xhigh is similarly capable and has not burned the world down.

nicechianti 1 hour ago||

[dead]

olbeardGear 8 hours ago||

[dead]

bellowsgulch 9 hours ago||

*Anthropic apologizes they got caught defending their moat by implementing invisible Claude Fable guardrails

simonw 9 hours ago||

If by "got caught" you mean "published it in their system card paper".

(Admittedly it was buried pretty deep in that 300+ page PDF, but they did at least disclose it. If they hadn't I imagine it would have taken quite some time for the research community to figure out what was going on.)

afthonos 9 hours ago|||

It was in the announcement, too. I’m 99% sure they edited it after they changed their mind, because I knew about it from reading that, and never opened the model card.

skavi 8 hours ago||

On the earliest web archive snapshot I can find [0], I do not see any mention of the safeguard/sabotage under discussion [1].

And to be clear, this isn't the safeguard where the model is explicitly downgraded to Opus, but rather where the Fable/Mythos model's "effectiveness" is transparently "limited" via "prompt modification, steering vectors, or parameter-efficient fine-tuning (PEFT)".

[0]: https://web.archive.org/web/20260609173222/https://www.anthr...

[1]: https://simonwillison.net/2026/Jun/10/if-claude-fable-stops-...

ajyoon 7 hours ago||||

I wasn't buried, it was on the third page after the ToC

bellowsgulch 9 hours ago|||

Yes, I actually do mean that. I skimmed the system card. Them stating it openly, doing it, and being called out on it just doesn't have any meaningful difference.

They could have simply told people "we do not permit using Claude models to perform frontier AI research," which is defensible from a policy point of view. This particular usage of their products requires no deception, nor hiding information prevent abuse.

However, instead, they chose for some reason to publicly display a morally poor way to execute a reasonable business decision (preventing abuse, defending your business interests, etc.)

afthonos 9 hours ago|||

They didn’t get caught, they explicitly said they would do that in the announcement. I think it was both bad and a weird idea, but it certainly wasn’t sneaky.

cyanydeez 9 hours ago||

is it a moat or just a way to implement the permanent underclass?

bauldursdev 7 hours ago|

To me it seems like it's more likely to refuse the harder the problem is. I wonder if it's cover for a model that's not as good as advertised. Even when I ask questions in biology it is switching me.