Anthropic apologizes for invisible Claude Fable guardrails

Posted by rarisma 12 hours ago

Anthropic apologizes for invisible Claude Fable guardrails(www.theverge.com)

https://web.archive.org/web/20260611122253/https://www.theve..., https://archive.ph/y4V4k

228 points | 262 commentspage 5

nrmitchi 4 hours ago|

I just _know_ there is a (probably fairly large) group of people at Anthropic trying very hard to not say "I told you so" today

rodrigodlu 6 hours ago||

The same week that they will move goalposts by blocking 3rd party harnesses on claude code. Nice.

I was a happy Max user.

aaroninsf 6 hours ago||

ITT a surprising lack of perspective on the fact that despite the breathless pace of the singularity, people are still necessarily figuring things out as we go and we are well off the map.

Here there be monsters, and we don't have any real way of evaluating risk; and the leverage provided by tools already available affords systemic and even existential risk in a way no one—least of all an industry committed to shareholder value—has had to navigate, let alone with a million backseat drivers each with their own substack and brand to build.

mystraline 6 hours ago||

Does "SORRY" fix the invisible garbage guardrails?

Does "SORRY" fix the deception these models use on the sly?

Does "SORRY" not silently downgrade you to a shittier model without notification?

Does "SORRY" refund your tokens or money?

Im guessing NO to all of those. Standard corporate sorry of "We're sorry youre offended and stupid and gullible".

BrenBarn 6 hours ago||

This just means next time they'll make sure to keep it really secret.

system2 7 hours ago||

Will Anthropic ever respond to these negative comments here? They won't.

reducesuffering 6 hours ago|

They literally just have. The ethos is explained here. If you don't bother to read or grapple with it that isn't on them.

https://darioamodei.com/post/policy-on-the-ai-exponential

system2 5 hours ago||

I said here, a human interacting with comments. You shared a blog post.

reducesuffering 3 hours ago||

All of these negative comments are addressed by the blog post. What do you want them to say, that isn't better answered by the details in their existing communications. No negative comment here was really novel.

system2 3 hours ago||

The blog post is passive-aggressive and does not address the main points.

rvz 7 hours ago||

Why would anyone defend Anthropic after this? Imagine falling for the DoW supply chain risk designation, and now this. This company is trying to ban powerful open models and restrict access to frontier models to slow everyone else down.

They just showed that they CAN do this right in front of you. Local open weight models are a necessity.

SilverElfin 7 hours ago||

Invisible guardrails? Or purposeful sabotage if you use it for building AI capabilities?

But also, it isn’t the only huge mistake Anthropic has made in the last 48 hours. Having a sneaky data retention policy, while also giving companies no way to block Fable, is a massive problem. And it is ridiculous that Anthropic has so little respect for its customers. OpenAI should take advantage of this.

trunnell 5 hours ago||

I'll defend Anthropic.

They are clear about the reasons for guardrails: prevent their models from doing harm in dual-use contexts including CBRN or by accelerating research in authoritarian-backed AI labs.

What is the critique against that? It seems pretty reasonable to me. You want AI-accelerated biological or radiological experiments running in your neighbors backyard? You want PRC-backed labs to continue to steal Anthropic's models via distillation?

Mitigating the harms of dual-use tech is notoriously difficult and fraught with trade offs. What I would want to see is cautious rollout and quick response, which is EXACTLY what they're doing.

Instead, this thread is full of bad-faith arguments about Anthropic being dishonest, making a "useless" model, or "the power is going to their heads." You can't read Anthropic's System Cards and come away with any of these impressions. Quite the opposite, in fact. They are honest to a fault, acknowledging problems they discovered even when it hurts them.

If your harmless request was downgraded to Opus, you're billed for Opus. They were 100% clear about that. I'd much rather have a Mythos-class model that falls back to Opus 10% of the time than be capped to Opus 100% of the time. If that doesn't work for you, then make a suggestion for something better!

If you are a white-hat security engineer hitting guardrails, I don't think you have standing to complain. I really don't. Their Glasswing program actually got banks and the industrial sector to take action to fix security vulnerabilities. Do you realize how special that is? A huge portion of the economy runs on vulnerable code and has for decades, despite security experts testifying to Congress, begging business leaders, pleading for intervention-- with no results. But suddenly they're all enrolled in a program that will find *and fix* vulnerabilities! White-hat security people should be rejoicing. Instead some of them are throwing rocks. Unbelievable. Shameful.

Meanwhile, society is screaming at the AI labs to be more conscientious about potential harms of AI. Legislatures are passing laws limiting data center construction. There are protests. And you, the HN community, the vanguard of our profession, have the temerity to demand "NO GUARDRAILS!" "HOW DARE YOU TRY TO PROTECT DEMOCRACY!" "MY SOFTWARE PROJECT IS MORE IMPORTANT THAN KEEPING NUKES AWAY FROM THE BAD GUYS!"

Go ahead HN, downvote me. It'd be an honor.

zozbot234 4 hours ago||

The original reporting of this from Anthropic didn't mention "authoritarian-backed AI labs" at all, only frontier ML research while leaving it entirely unspecified and unverifiable what was meant by "frontier". It's obviously reasonable that people would complain about that. And the notion that distillation-at-a-distance could be used to comprehensively "steal" a model, especially a frontier reasoning model that's likely relying on massive amounts of test-time compute, is completely unproven and quite ludicrous if you know anything at all about ML.

trunnell 4 hours ago||

"Anthropic accused Chinese firms of 'industrial-scale distillation attacks' on its AI models."

"Distillation involves training less capable models on more advanced ones’ output, and can be used illicitly to acquire powerful capabilities cheaply. The AI startup accused China’s DeepSeek, MiniMax, and Moonshot of generating 'over 16 million exchanges with Claude through approximately 24,000 fraudulent accounts,'"

https://www.semafor.com/article/02/24/2026/anthropic-accuses...

After reading their posts and watching interviews with Dario it's abundantly clear that they view Chinese-lab distillation of US frontier models as a threat to US national security. You can argue with them about whether that is true, but not whether distillation is real.

zozbot234 4 hours ago||

It's definitely real, in the sense that it's a real violation of ToS. It could perhaps be used to guide a few narrow capabilities in very specific domains, given a model that's already most of the way there. But no, it's nowhere near the same as "stealing" a model outright, nor does it replace basic innovation in AI. And it's indistinguishable from practices that have long been common in the industry as a matter of fact, regardless of any ToS requirements.

trunnell 4 hours ago||

Oh, I agree distillation isn't stealing "outright" as in it's not theft of 100% of the model. But there's a reason they're doing it. I didn't say anything about Chinese labs innovating -- obviously they are.

What accounts for the difference between your attitude that distillation is no big deal, "common practice," yet Anthropic sees as it as a huge threat?

zozbot234 4 hours ago||

I never said that "it's no big deal". It's a clear-cut violation of ToS, and Anthropic are within their rights to care about that.

bellowsgulch 7 hours ago|

Such a weird openly immoral way to defend your moat, too.

Why not just tell people, "To defend our ability to be competitive in our industry, we ask that you do not use Claude or any of our models to independently perform research on large language models or any of its related architectures or technologies. In order to prevent this violation of the Terms of Service, we have trained Claude Fable to deny any requests or prompts which involve frontier AI research."

More comments...