Anthropic apologizes for invisible Claude Fable guardrails

Posted by rarisma 9 hours ago

Anthropic apologizes for invisible Claude Fable guardrails(www.theverge.com)

https://web.archive.org/web/20260611122253/https://www.theve..., https://archive.ph/y4V4k

204 points | 227 commentspage 3

jmount 3 hours ago|

The whole arc was brilliantly evil. Once they put int the guardrails then Claude is fully un-falsifiable, and failure can be claimed intentional.

doubtfuluser 2 hours ago||

I’m wondering if their internal name is “Sophon” for this “feature”…

sometimelurker 4 hours ago||

I don't like this shift in the Overton window, or at least their perspection of the Overton window. I really do like their open work on mech interp tho. least bad AI lab imo.

also if they do this or not is unprovable and other labs will probably silently implement this too. it'll be 100% normal by this time next year

kingcauchy 4 hours ago||

How much of the apology was written by Claude? How much of the release note process was written by Claude? Will they have better prompts going forward to make sure Claude doesn't write upsetting things into the release notes for devs like silent nerfing? Spooky times.

klmarks 4 hours ago||

The restrictions are there so that security researchers cannot disprove the Mythos claims:

"You see, Mythos can automatically break out of a VM running on SELinux, but unfortunately this is too dangerous and we had to implement guardrails for the Fable peasants."

mlazos 4 hours ago||

The idea of them purposefully wasting my time by having the model act dumber and me having to argue with it without knowing if it’s the prompt or the model was just such an idiotic product decision I can’t believe they shipped that without getting any feedback from users first.

whimsicalism 4 hours ago|

it's not a product decision, it's a safety decision. if you understood what they think they are building and the culture inside of anthropic you would understand why they did it.

michaelcampbell 4 hours ago|||

Safety from what? Competitors? That sounds like a product decision. They're puking on any requests that could be used to create LLMs or competitive products.

trunnell 1 hour ago|||

To prevent their models from doing harm in dual-use contexts including CBRN or by accelerating research in authoritarian-backed AI labs.

knollimar 39 minutes ago||||

Anything to prevent mecha ai hitler. At all costs

JTbane 4 hours ago|||

I would guess prevention of using Claude as a pentesting or hacking platform. This could mean that every script kiddie out there would be a massive risk.

Rapzid 4 hours ago||||

The road to hell is paved with "good" intentions.

efromvt 4 hours ago||||

I think you can sympathize with the safety motives while still thinking this was a dumb implementation to degrade silently? I actually have faith in them getting the guardrail triggers pretty good, but consensus seems like they’re not yet there yet.

whimsicalism 4 hours ago||

I think it is clear given the stakes why you would not want to make your guardrails probe-able/invertable.

fooker 4 hours ago||||

> if you understood what they think they are building and the culture inside of anthropic you would understand why they did it.

This seems like a cult with extra steps.

Related: I interviewed for Anthropic a few months ago and in place of the usual HR call they have one where they have someone with a suspiciously relevant degree grill you about how committed you are to the 'mission'!

I probably came off as being skeptical, and then, hilariously, I was strongly encouraged to read the book published by the CEO to 'form accurate opinions' on AI safety.

largbae 4 hours ago||||

We do understand why they did it, and the reason is dark and cynical.

j-bos 4 hours ago||||

Don't buy it. It is actively deceiving the customer and charging them for the privilige of being lied to.

deadbabe 4 hours ago||||

They did it to make more money as you waste more time burning tokens with bad responses.

3fffa 3 hours ago|||

[flagged]

whimsicalism 3 hours ago|||

You are just completely wrong about what the driving motives are and an asshole to boot.

3fffa 3 hours ago||

[flagged]

km3r 1 hour ago|||

How does degrading responses to a cheaper tier jack up revenues?

hatthew 3 hours ago||

Part of the premise of the article is blatantly wrong. Distillation prevention was always visible. The only invisible safeguard was against frontier model development like development of training pipelines. This doesn't change the general idea that invisible degradation is bad and has been reverted, but the article changes the framing of the original issue from "preventing accelerating AI in the future" to "preventing cheaper AI right now".

decorner 3 hours ago|

New overlord, same as the old overlord.

More comments...