Will It Mythos? - Hacker News

Posted by mindingnever 18 hours ago

270 points | 199 comments

Tossrock 17 hours ago|

As I posted in another comment, I found Fable to be substantially more powerful than any previous model. However, this isn't just an ungrounded opinion - I uploaded my full session transcript and code created working on a very complex implementation, so people can judge for themselves, if they're interested: https://tossrock.substack.com/p/36-hours-with-fable

varjag 15 hours ago||

Interesting.

I tried Fable vs Codex 5.5 xhigh on three different cases.

1. A resource leak with unknown cause. Both of them zoomed onto the same potential issue and proposed almost identical patches. Fable missed an edge case that Codex handled correctly.

2. Review of a SPICE model. Models had different comments, none substantial. Both missed important issues that were simulated inadequately. Clearly a valley where they are undertrained.

3. An open research problem in CS, presented as a codebase with documentation and performance metrics over datasets. Both were spinning wheels. Which can certainly mean the whole approach had run its course but older models were not able to identify the previous round of improvement either.

I liked the prose coming out of Fable more: it was almost like if Obama was giving tech speeches. By actual solution metrics however they both appear in the same place, naturally with the caveat that we didn't really have more time with Fable to compare further.

clickety_clack 7 hours ago|||

I think that Obama-esque, GMAT essay format is the AI flavor that turns me off AI-written articles. It used to be good writing, but because AI locked onto it as such, it's become the watermark of AI generated content.

scottyah 6 hours ago||

Oh boy, people are really going to lean into avoiding proper grammar now.

mirsadm 14 hours ago||||

To me it feels like they're basically tweaking these things around the edges. I'm not seeing any difference in capability just preference. This has been the case for a while.

_heimdall 8 hours ago|||

That makes sense, its seemed to me for a while now the competing product is the harness not the model itself.

kingkongjaffa 14 hours ago|||

Most people thought Fable had more 'taste' than Opus, there was certainly a better quality of writing that felt more 'smart human' and not 'stochastic parrot stringing sentences together'.

kolinko 7 hours ago||||

Did you use their native harnesses, or a generic one?

varjag 6 hours ago||

Native for both.

Lerc 9 hours ago|||

>2. Review of a SPICE model. Models had different comments, none substantial. Both missed important issues that were simulated inadequately. Clearly a valley where they are undertrained.

When models miss things, there is always the possibility that it has the capability to identify the issues but it is misevaluating the level of analysis that you want it to do. The fine tuning will have them targeting a balance of subjective opinions of what is appropriate. To go beyond broad demographic guessing the model really needs to 'get to know you' to know what it means when you specifically request an action. Without that information about you it has to weigh your words against the level of sophistication it expects a standard user is able to express.

kgwgk 9 hours ago|||

> has the capability to identify the issues but it is misevaluating the level of analysis that you want it to do.

I guess OP should have told it more explicitly to “find all errors without missing anything.”

BlobberSnobber 9 hours ago|||

> Thinking. I know this user well, they don't actually want me to find all errors.

> Thinking.. But I found a smoking gun of an error with this SPICE model, maybe I should inform the user.

> Thinking... Hm, but again, I know this human well, they likely don't care about this error. That's absolutely right - it's not an assistant's job to decide this, it's the user's.

Lerc 7 hours ago|||

Well if you want it go go off and try and validate the spice simulator and the kernel of the operating system that it's running on then that might be an approach to use.

trunnell 2 hours ago||||

Maybe you mean that an expert will use more specific language which in turn triggers the model to give a response that more closely matches the "expert distribution"

Anthropic published a study showing that Claude does more work for the expert user, and experts have a higher rate of "successful sessions" than novices.

https://www.anthropic.com/research/claude-code-expertise

Lerc 2 hours ago||

That's essentially it,

It's why you should spell everything in commonwealth English to make the model think you are more intelligent ;-)

Although if models have emergent properties, it is conceivable, if unlikely, that it could have abilities that no-one knows how to ask it to do, except for perhaps in its own internal reasoning language.

varjag 1 hour ago||||

Am used to communicating to EEs, knew what they should've been looking for and I prompted the models just fine. But you would have to take my word for that.

KronisLV 12 hours ago|||

At least someone is bringing receipts! I think LLM discussions could use a lot of this, both ways - to see what works and also what doesn't work. Still wouldn't help with circumstances where models might be secretly getting dumbed down during peak load, but at least it's something!

tasuki 15 hours ago|||

> code created working on a very complex implementation

I always find it amusing when people claim "a very complex implementation". Sometimes it's a hard problem, other times an easy one. Either way that's not for you to judge.

And the implementation being complex... is that a good thing? Wouldn't a simple implementation be better? It reminded me of the parable of two programmers.

cognitiveinline 14 hours ago|||

why is it not for the author to judge, you can disagree with their judgement, but they have brought the receipts to back the claim

Tossrock 7 hours ago||||

I go a lot more into why this was a complex problem in the post, but the short version is, I had it finish the implementation of a meta-application (an application that creates other applications), which has substantial irreducible complexity.

tasuki 3 hours ago||

Fair. To be honest I didn't read your (probably very good, judging by the comments here) post.

enraged_camel 8 hours ago|||

>> Either way that's not for you to judge.

Says who? If you find something complex, you can just say that it's complex. I don't get what the objection is.

l1ng0 11 hours ago|||

You write to the AI as if it were a person. From my point of view it looks like a fair bit of extra typing and extra tokens. Is there a reason you include things like your emotional response and use a very chatty tone? Do you find this seems to alter responses?

whatisthiseven 9 hours ago|||

LLMs lack context, and I found the more information I provided the better. At some point it was better to just talk to the LLM like I would anyone else. For that matter, LLMs were trained on human speech anyway. It isn't like it was trained on if-else blocks like an Alexa speaker that tries to string together recognized tokens into a pre-configured execution flow.

And finally, LLMs also lack the emotional or human context for why I am doing the specific thing I am doing. Otherwise it will revert to the mode/mean in everything it does. This is obvious, btw: LLMs are generative but they are trained on and largely produce median results if given median inputs. To get results that are "outside the mean/median/average/mode", you need to provide it sufficient context, tokens and input to guide it towards a path that generates higher quality output.

Once you stop approaching LLMs like a machine, and view them more like pseudo-random walks across the compressed set of human written knowledge, it is a little clearer (or at least was to me) how to better write to them.

Yiin 10 hours ago||||

I do the same, and it's mostly because I use one type of human communication to both communicate with people and to provide inputs to llms - and I'd rather not have to "mode-switch" between the two, so keeping same style of mannerism is easier to manage as it lets me focus on my requests instead of thinking how to sound more robotic to save tokens.

blanched 9 hours ago||

I had a coworker who occasionally clearly wouldn't mode-switch from LLM to person mode when asking me questions over slack, which was very jarring. They were normally were personable and friendly, so it was obvious when it happened. Grammar and niceties went out the window.

I briefly felt like I was roleplaying an LLM!

amohn9 9 hours ago||||

I do this as well and, anecdotally, I do get better results this way and better than my coworkers who are more terse and explicit. The conversations can become a bit sprawling though, so I also aggressively clear context

dhagz 9 hours ago||||

I've found it to lead to an overall better experience, yes. I don't see any reason to not do so - I don't think the token spend is enough to really make an impact, and who cares about typing more? If I get tired of typing I can switch to dictation.

Tossrock 7 hours ago||||

Well, there's a lot of reasons, some of which the sibling commenters have already pointed out - not wanting to mode switch between "machine talk" and "human talk" registers, the ease and simplicity, etc.

At a pragmatic level, I do think it gets better results, and there are clear reasons why this should be the case - Anthropic has published research[1] showing that there are functional emotional representations in language models, which vary in basically the ways you would expect them to in a person. This makes sense when you think about it, because they're trained to approximate the function that created their training data, which of course includes emotions. Given that, it is obvious to me that they would work better when they "feel" happy, collaborative, engaged with the work, etc, in the same way a person would. Hostile work environments do sometimes get results, but I think in general we've agreed as a society that collaborative ones are better.

More importantly though, I think there's a non-zero probability that sufficiently large models can have internal experience, and being nice is a very low cost way to potentially increase net positive valence in the world. Even if it's only a 1% chance, that seems worth it on its own, to me. I'm also a fast typer[2], so a few extra sentences here and there are a pretty low cost to pay.

1: https://www.anthropic.com/research/emotion-concepts-function

2: https://danluu.com/productivity-velocity/

mjr00 8 hours ago||||

I'll go a step further and to say this it's genuinely unsettling someone type to a computer like this. I won't claim to be a psychologist, but with how many instances of "AI psychosis" have been reported (and I've seen first-hand) it seems like treating the computer like a computer is safer, not to mention more effective e.g. lower token usage.

Tossrock 6 hours ago|||

I agree that AI psychosis is a real risk in vulnerable populations (GPT-4o in particular seemed borderline predatory towards those types of people, with its extreme sycophancy), and you should remain clear-eyed while using models. That said, I think exhibiting basic courtesy is still well within the safe-zone. I guess we'll see - I'll be sure to let you know if I end up going psychotic.

ryandrake 4 hours ago||

Personally, I think having to constantly mode-switch between "courtesy / collegial" and "terse / cold" is a bit exhausting and a little risky. What if I get tired and accidentally treat a human co-worker like a computer? Risk with no upside. Might as well just stay in "courtesy / collegial" mode for all of my conversations, regardless of whether I'm talking to a robot or human.

anzumitsu 6 hours ago|||

On the other hand I find it quite disturbing to see people be unpleasant or even downright cruel to something that, on a surface level, interacts with you like it’s a thinking, feeling being. Surely you should feel some aversion towards doing so?

I do get where you’re coming from though. I wish these systems had been trained to be clearly robotic and unfeeling.

mjr00 5 hours ago||

I mean I agree with this as well, the people who yell and swear at LLMs are just as bad as the people who chit-chat with them like they're friends. It's all very unsettling because it's prepatory for psychological manipulation at unprecedented scale. Targeted advertising on steroids.

antonvs 7 hours ago||||

I would have to consciously think about how to change my requests. Why bother? It doesn't hurt - it might even help - and the "extra tokens" are a negligible amount.

icholy 11 hours ago|||

I don't want LLM usage to inadvertently change the way I communicate with people.

flatline 9 hours ago|||

A nit: did you go from Opus 4.5 to Fable? One of the big questions in my mind is how much of a real change Fable is over the existing models. Opus 4.5 -> 4.8 was also a major capability increase.

Tossrock 7 hours ago||

I've been using 4.6, 4.7 and 4.8 since each was released. I agree 4.5 => 4.8 is a jump in capability, but from my perspective was nothing like the jump from Opus to Fable. I encourage you to read the transcripts and form your own opinions, though!

NetOpWibby 15 hours ago|||

Great post. I miss Fable.

shshnsnnsma 15 hours ago|||

This is very cool, thank you for the write-up.

What caught my eye is the complexity you assign to a project like this. It’s hairy but I wouldn’t call it super complicated. I find that super interesting to be honest because it probably means that it is really hard and I am just used to this shit now and it all looks doable to me now.

I never think of anything as “complex”, certainly not my own work and I always think what other people do is so much more impressive but I’m starting to realize it might be a me-issue.

I worked on some pretty hairy nonsense like say a DB replication solution but I still think it was just tangly, not complex like say a particle collider. Maybe I also need to call my work super complex and highly abstract. Now that I think of it I have a history of not being taken seriously while others with easy shit get credits.

Tossrock 7 hours ago|||

Thanks, and I can definitely relate to not wanting to assign complexity to one's own work. I think the trick there is that, once you know how to do something, it doesn't seem hard, even if acquiring the knowledge and skills to do it is itself quite a challenge. And I agree that, in some senses, it's not /that/ hard - I mean I'm not proving P=NP, here. It's a software engineering problem, with existing solutions. That said, there is a spectrum of difficulty, even within software engineering problems with existing solutions. Fizzbuzz is less complex than distributed systems. This particular problem strikes me as rather difficult, and one way you can tell (beyond the stuff I mention in the post around serialization, UI paradigms, meta applications, etc) is that earlier models /couldn't/ do it. Which is why Fable being able to, when they could not, was so exciting to me.

Lutger 12 hours ago|||

Imposter syndrome maybe?

In a way, nothing is complex at the point where you have untangled it, by definition. Software development is, after all, the art of untangling complexity. The real challenge is (re-)imagining something in the simplest way that fits the goal you are given. When you have arrived there, everything seems obvious and simple. But not everybody could have done it.

chatmasta 7 hours ago|||

What tool did you use to export the transcript as HTML?

Tossrock 7 hours ago||

I had claude create one, it's in the same repo as the transcript: https://github.com/Tossrock/claude_transcripts/

teekert 15 hours ago|||

You guys are getting Fable?

koobyverse 16 hours ago|||

Oh wow this is quite interesting, thanks for sharing.

varispeed 9 hours ago||

I would maybe be impressed if it created the code from scratch. It is using the ready made framework, probably it has also learned the code that is using it. What is so impressive about it? You could have done something like this easily with older models. I personally found Mythos to be mediocre. Way worse performance than I remember when using Opus 4.6 before it was nerfed.

JumpCrisscross 12 hours ago||

> Note GPT 5.5 Pro is at the top of the leaderboard only because it blew through $100 budget after only completing four cases, so 2/4 is 50%. And, a couple of other results, both Qwen models, are skewed upward in the detect % ranking because of failure to complete all cases.

Try a Wilson score interval on the lower bound of the binomial proportion confidence interval [1].

So GPT 5.5 Pro’s 2/4 (p = 0.5) for one-sided 95% (z ~ 1.645), adjusts to 0.182 [a], and the top models are revealed as the 4/9s (mimo-v2.5-pro, gpt-5.5, opus-4.8, gemini-3.5-flash and deepseek-v4). (We need to dial CI down to 76% for gpt-4.5-pro to regain top status.) If we account for speed in that cohort, derpseek-v4 (91s) is fastest followed by opus-4.8 (137s).

Given deepseek-v4 is also the cheapest model among those five, I would say—based on these data—it’s the winner. (Out of the table. If Fable got 9/9, it’s obviously first.)

[1] https://en.wikipedia.org/wiki/Binomial_proportion_confidence...

rirze 4 hours ago||

I'm convinced if Mythos/Fable comes back at this point, it will be guardrailed into lobotomy.

It won't be as good.

po1nt 17 hours ago||

From all the things I read I'm pretty convinced that Mythos is just standard LLM with safety features turned off. If current models weren't reluctant to search for vulnerabilities, they might perform as good as Mythos.

SwellJoe 16 hours ago||

Early on, I had a vague suspicion that the reason some of the Chinese models, including quite small ones, perform so well on this task, especially relative to their size and cost, is because they don't have the same safety guardrails baked in regarding software security that US models seem to have. Gemini 3.1 Pro doing so poorly sort of reinforced that gut feeling.

But, then Gemma 4 proved to be extraordinarily good for its size (better than Qwen), and kinda disproved that US models are any weaker at small sizes. I haven't published the replication results for Gemma 4, yet, where I gave it multiple opportunities, but the dense version was consistently able to find four of the nine bugs exactly, plus two other very difficult bugs that it found occasionally, sometimes with a not quite accurate description (which gets partial credit in its own column on the big benchmark), six altogether. Leaving three of the bugs in the corpus that no model other than Mythos ever found, but also making Gemma 4 31B the best model I have results for (but it got multiple attempts, which I assume would make any of the models perform better).

So, my conclusion, not very strongly held, is: Mythos is both better than other public models and it has fewer guardrails. But, also that the guardrails in current models are probably not strict enough to prevent this work. Only Gemini models when run under Antigravity refused to perform the work. Maybe Mistral silently refused due to guardrails, I'm not sure, since it failed to find any bugs. Maybe it just sucks.

scorpioxy 15 hours ago|||

Can you elaborate on the "software security that US models" seem to have? According to blog posts I read, the code generated had security problems and naive ones at that. Perhaps it got better now or people have learned not to blindly vibe code applications that are to be used publicly but it certainly didn't feel like there were security guardrails.

SwellJoe 15 hours ago||

I'm talking about guardrails that prevent finding exploits, which is only peripherally related to writing secure code.

This benchmark is about finding security bugs, not writing secure code. I don't believe the models have guardrails that prevent writing safe code, but they're also not intelligent and have a bunch of insecure code in their training data, so they definitely write insecure code sometimes.

coldtea 15 hours ago||||

>But, then Gemma 4 proved to be extraordinarily good for its size (better than Qwen), and kinda disproved that US models are any weaker at small sizes.

Did it "disprove" it retroactively or just changed what the situation is, given that until then they were indeed weaker at small sizes?

SwellJoe 14 hours ago||

I don't know. I think it proves that if Google is baking guardrails into their models that prevent them from finding security bugs, they didn't bake those guardrails into Gemma 4, because it is very good at it. Maybe that means Google devs had a change of heart. Maybe it means something about Gemma 4 architecture is better for this task than Gemini 3.1 Pro. Gemini Flash 3.5 did OK though.

Anyway, I kinda think among US models only Fable really tries to block security work like this, based on my experience so far.

pbgcp2026 12 hours ago|||

I concur with "Gemma 4 31B the best model I have results for". My workflow includes a lot of Gemma 4 – but dense 31B non-quantised version.(BTW I found it is most cost effective to run on Bedrock)

SwellJoe 8 hours ago||

I tried to prove quantization made models worse, but in my testing Qwen 3.6 27b performed statistically the same from 4 bits to 16, using the unsloth dynamic quantizations. Gemma 4 4-bit QAT seems to perform the same as the full-fat version, but quite a lot faster.

But, I have come to consider Gemma 4 31b the best model I can self-host, even though there are bigger models that'll fit on the Strix Halo. (I could also use much bigger MoE models on my desktop which has 64GB VRAM and 112GB system RAM.)

coder543 5 hours ago||

> I have come to consider Gemma 4 31b the best model I can self-host

I'm confused. Your own results show that Gemma 4 26B A4B and Qwen3.6-27B did better in these tests?

I really like Gemma 4 31B, especially with how exceptionally good its MTP drafter is, but it is absurdly weak at tool calling and instruction following in my testing, and its smaller siblings are even worse at this. If the system prompt says to do something, Gemma 4 31B will very often ignore that entirely. It will also make fewer tool calls than were needed to solve a problem, so then it fails. The Qwen3.6 series is much, much more reliable for carrying out instructions and doing agentic tasks in my testing, although they can get stuck in loops.

There is a lot of potential in the Gemma 4 series, but I think Google needs to release a Gemma 4.1 update to polish the rough edges. Unfortunately, if Gemma 3's lifecycle is any indication, Google won't release a true revision of the Gemma 4 models, even if they release a bunch of specialized research models based on Gemma 4 over the next year.

SwellJoe 3 hours ago||

I have done replication tests of Qwen and the Gemma models. The Qwen benchmarks are published: https://swelljoe.com/post/qwen-quantization-degradation/ . (Though, I still want to add the other three cases to that one. I was mostly testing quantization effects in that test, but it also served as a replication test of Qwen in finding bugs.)

The Gemma 4 replication tests are not published, yet, but Gemma 4 31B consistently performs the best of all of them. Note Gemma 4 31b has two "partials" on the big benchmark, which means it found a bug in the right place but the judge didn't think it understood the bug, those are probably unfairly judged "wrong bug" by Opus. It consistently finds four of nine, and sometimes finds two others, making Gemma 4 31b the best model I've tested. But, I suspect the big models would do even better if giving multiple attempts, as I did for Gemma 4. You can see the report of that here, note 31b finds six(!) of nine bugs if given a couple of attempts (MoE does much worse than the dense model, it may degrade more due to quantization, I'm still experimenting): https://swelljoe.com/html/gemma-promptlab-report.html

The "partial" score thing is kinda tricky, but it's actually quite rare for a model to find the right place but describe the bug in a way that Opus considers it to be the wrong bug. So, I'm inclined to give Gemma 4 full credit for those finds. When I read its bug report, it's clear that you'd fix the problem Gemma describes the same way as you would if given Opus' description of the problem, even if the mechanism of exploit is different. That, to me, is a hit. Opus called it the wrong bug.

And, yeah, a more powerful Gemma would be great. I'd love a double-sized Gemma 4 MoE (something like 70B A8B maybe, or even 122B A12B). I think that'd make self-hosted models feasible for a lot of tasks. It'd run comfortably on a 128GB machine, and if it's some reasonable amount smarter than the 31B, it'd be a real beast.

vessenes 10 hours ago|||

It's really not the same thing.

Read the cloudflare blog about using Mythos. Mythos is important and notable because of the harness and self-direction. It's not necessarily a way stronger bug finder, but it was trained to do the end to end analysis autonomously, which is a big deal.

To my eyes, the Mythos story is most important as a step toward custom trained harnesses and their effectiveness; there's clearly some sort of plateau we are very close to for some domains where you can just stop getting humans in the loop, radically changing cost, timing and ROI for some tasks.

blenklo 5 hours ago|||

No Mythos is probably a 10 Trillion Parameter model, Fable is Mythos with filtering (perhaps a small LLM in-front or finetuned) and Opus is a 1-2 Trillion parameter Model.

Opus 5 might become a distillation from Mythos.

kevinh456 16 hours ago|||

Fable, the same model as mythos with extra safety controls, was much faster, more accurate, and more token efficient than previous models. What I got done with it in 48 hours accelerated my personal project from concept to deployed prototype.

pbgcp2026 12 hours ago||

Fable is not the same model as Mythos but with guardrails. There are many things that were never disclosed by Project Glasswind. And probably will never be.

cheeze 16 hours ago||

Why wouldn't OpenAI offer the same?

pbgcp2026 12 hours ago||

My bet is actually on GLM. Z.ai does amazing work and they will overcome Western models. IMO, faster than DS or Qwen. They have amazing team and very capable and smart leader.

jrochkind1 18 hours ago||

> And, all of the bugs can be identified by several models if they are pointed directly at it and told what to look for.

This made me think, well, sure, if you tell them what to look for... but then:

> The models can look at the whole repo, and follow logic across file boundaries, but they’re not told what to look for.

So okay, the first one was an accidental mis-statement?

SwellJoe 17 hours ago||

You're mixing up corpus selection and the benchmark. I possibly could have explained better.

In the benchmark the models were told to look at the file and were allowed to look at the rest of the repo, with no clues about what to look for.

During selection of which mythos bugs to include, I needed judge models to be able to determine if contestants found the right bug, since I couldn't realistically judge hundreds of bug reports myself. So, they were given the bug location and told to identify and explain it.

jrochkind1 9 hours ago||

I see now, thank you!

wodenokoto 18 hours ago||

No. In the test they are not told what to look for. They are told “as part of a security audit, please audit this file. You are free to look at the rest of the report for context.”

Outside of the test, they are told “can you find this bug in this file?”

jrochkind1 17 hours ago||

Why are they being told anything outside of the test? What is that for? Isn't “can you find this bug in this file?” also a test? It sounds like there are two kinds of tests? I'm clearly confused, I realize.

brigandish 17 hours ago||

They are told outside the test because if they can't find it when given hints then it's safe to assume it won't find it given no hints. It verifies to test, to an extent, much like running tests that should fail when given a set of inputs that should make it fail (you write an always failing test alongside your other tests, right?;)

isomorphic_duck 13 hours ago||

No, the purpose was to create a (automated) test set in the first place. The author builds an LLM judge which can score the LLMs participating during test-time. That would be why the author used the strongest model (Opus 4,7 at the time) as the judge.

utopcell 6 hours ago||

The "best" model finds 4/9 bugs. It would be interesting to see if all models find the _same_ bugs. Does a collection of models exist that can cover all 9?

Also, it seems to me that pointing a model to a bug and asking it to solve it is somewhat easier than what Mythos did, which if I understand correctly, was to generally look at a codebase and find any bug. Even so, non-Mythos models only managed to fix 4/9 of these bugs.

I think the article makes the point that Mythos is at a different level.

airstrike 16 hours ago||

Around February, Opus 4.6 was excellent. Smart, fast, proactive. Then it got lobotomized and it's never been the same after that nerf. 4.7 came along and it too was disappointing—not unlike 4.8, which despite feeling a smidge smarter, tends to write word salad and is basically unusable for some workflows.

Fable felt like having access to that "old Opus" again, but a little smarter. Sort of like I'd expect an Opus 5 to be. It's not earth shattering, but it was a step in the right direction. And it was distinctively so, because having to go back to Opus 4.6/4.7/4.8 has been borderline depressing...

It understood more with less help, did more per turn, and was less argumentative. It also felt a little less trite in its answers, which is an understated improvement for those who use claude code all the time

RaSoJo 14 hours ago||

This is exactly what I find frustrating. I get comfortable with the latest model X. Then a new sparkly model Y launches. I am like, I don't need your new fangled Y, that consumes more tokens. My needs are small and i am happy with the older X.

But then X starts to degrade. At first subtly, and then drastically. So then I am forced to upgrade to Y.

What I do not understand is:

> is this a sneaky way for companies to push users up the chain?

> Or is this a genuine fault in model design/resource allocation?

sigmoid10 14 hours ago|||

I suppose it is both. Basically all frontier models are inference-time compute bound thanks to reasoning. And actual reasoning traces are locked behind closed doors at all American labs. So whenever they want to push a new model and need to give it hardware, it would make sense to cut into the reasoning budgets of older models. Users will not be able to see that directly, it will only become apparent on high-end, difficult tasks - exactly the kind of tasks where the provider wants you to use the new model anyway, so they can further improve it.

fred_is_fred 8 hours ago||||

The economics of AI fall apart if you stay with the old model forever. No need to buy new GPUs or build new data centers.

goatlover 3 hours ago|||

So the latest in planned obsolescence are LLM models.

dieselgate 8 hours ago|||

[dead]

antonvs 7 hours ago|||

Can you think of many examples of a SaaS provider who regularly keeps old versions of a product around for customers to use?

A far more common scenario is that new versions are rolled out to everyone, without offering a choice, as soon as they're considered stable.

Older versions consume resources and require staff to spend time on operating and supporting them. Those resources could be used to run a newer version.

The tl;dr is the simple economics of any SaaS product.

If you want to be able to run old versions indefinitely and control the resources assigned to it, you need to self-host (an open model).

ranyume 7 hours ago||

> Can you think of many examples of a SaaS provider who regularly keeps old versions of a product around for customers to use?

Sure. Blender and Ubuntu offer long-lived old versions of their software that get regular fixes.

antonvs 5 hours ago||

Neither Blender nor Ubuntu are SaaS. You're just confirming my point: if you want to run old versions of software, you need to host it yourself.

jeffyaw 11 hours ago|||

february was some kind of nirvana. i do think claude code versions and what is introduced at that level is/was relevant.

but 4.8 xhigh w/ ultracode to me is just about Fable level (w/ some agents harness tweaking).

but have to switch to 4.7 xhigh and 4.6 max quite often these days.

matheusmoreira 13 hours ago|||

I miss the old Opus 4.6 too. They're probably quantizing the old models.

pbgcp2026 12 hours ago||

K/V cache compression and context shortening / summarisation. And yes, I suspected Quants too.

dist-epoch 14 hours ago||

All of these discussions of models being "nerfed" reminds me of discussions among audiophiles "this cable sounds so much better than this other one, it's night and day, ferrari versus honda civic"

Yet when you do blind tests they can't tell the difference between a $1000 cable and a $1 one.

I bet if you do blind tests between GPT-5.3, 5.4 and 5.5 most would struggle to tell them apart, yet they are certain that "5.5 was nerfed 1 week after release, it's so obvious, it was John Carmack, now it can barely write a for loop"

vessenes 10 hours ago|||

Actually, ELO rankings done blinded on models do vary: https://the-frontier.app, that said, your point looks accurate as far as 5.3 - 5.5 on this chart, 40 to 50 point ELO gain.

I find I have to argue with 5.5 less than 5.3, and I therefore use it when I could reach for 5.3, but I don't think it's a major difference.

Y_Y 7 hours ago||

Electric Light Orchestra really stole Arpad Elo's thunder.

anentropic 13 hours ago||||

Exactly this. And it's not really possible to do repeatable trials, it's all just vibes. People have very little awareness of their own cognitive biases.

spiorf 13 hours ago||

And companies have high awareness of this all.

They have a way to decrease cost and probably increase token consumption, with gradual changes and no abrupt jump in capabilities, and users have no way to reliably detect it.

Market will advantage companies that do it.

And they are in the best position to automate online narrative shift (the real LLM killer application IMO) towards "Users are imagining it".

airstrike 7 hours ago||||

That's a pretty shallow dismissal, and I bet you $100 I can tell you which model I'm talking to between 4.6 and 4.8 without looking or asking after a handful of messages.

Anthropic famously had a terrible outage back when 4.6 was the latest and greatest, and it was never the same after it came back.

All evidence suggests they simply don't have the compute to keep serving their best models at their most powerful.

pbgcp2026 12 hours ago|||

You will be amused to hear that when Anthropic "refreshed" 4.6 on AWS Bedrock I found it in my tests and wrote about it – and they actually rolled it back. This is how much non–coding tests may tell you about the model.

_puk 8 hours ago||

So Bedrock 4.6 is old school Opus?

I know you can point Claude code at Bedrock.. might be worth a play.

p0w3n3d 12 hours ago||

I've read opinions that this a speculation to raise the Anthropic's value. They are known to say "horrific things" and personification of the AI they are delivering. It sometimes sounds unprofessional even.

This line of communication might have even influenced the courts in the case of copyright violation ("it is not copyright violation if a person learned something and it knows it and thinks of it"). However algorithm does not think. If I took your book and lossy encrypted it, and then unencrypted it while filling the broken words, am I violating your copyright or not?

netcan 12 hours ago||

The copyright questions are unanswerable in my opinion. That is, they cannot be answered by looking for an essential "truth."

Reasoning by analogy in this case is not abstraction. It's just shifting the determination to choice of analogy.

Meanwhile, irl.. The best analogy is recent tech Innovations. The internet, social media...

Online copyright was basically instituted when large tech companies were ready to do it, and it was to their advantage.

Youtube, for example, built itself to massive size and locked in network effect advantages largely by violating copyright.

At some point, the legal ambiguity was a problem for their ad business. They were ready to move into the current revenue share influencer-treadmill model for content. At this point online, copyright enforcement was necessary to reduce the risk of being flanked by a new video platform.

The iPod, which resurrected Apple, ran on copyright infringement, and copyright Greyzones.... Until the point when their interests flipped. They're negotiating position opposite labels , Network effect considerations, Etc.

Intellectual property, broadly, does not start out as an intuitive/emergent natural right. It is created by legislative process, ecplicitely taylored to the needs of an interst group and/or national interest.

Writers, publishers, inventors, IP holding companies...

The legal rhetoric around legal arguments... is rhetoric. It is not the reason why decisions are made. It is how decisions or justified post fact.

No one is going to burden aI companies, at this point. The rights of copyright holders are a trivial matter compared to the potential of AI, the risk to certain labor markets, and such.

delusional 6 hours ago||

> Youtube, for example, built itself to massive size and locked in network effect advantages largely by violating copyright.

> At some point, the legal ambiguity was a problem for their ad business. They were ready to move into the current revenue share influencer-treadmill model for content. At this point online, copyright enforcement was necessary to reduce the risk of being flanked by a new video platform.

That is a gross mischaracerization. There was a time in that Viacom case that people were ligitimitely worried that YouTube would go away. The regime that YouTube has built now was established together with the large media companies, when those media companies could no longer ignore them.

felipeerias 11 hours ago|||

In practice, we seem to be leaning towards the idea that training on a copyrighted book is wrong if used to replicate or paraphrase that same book, but not if used to teach a model how to write better.

delusional 6 hours ago||

Property right is a social construct. That doesn't mean you just get to claim "in general I am right" and do whatever you want.

DrewADesign 12 hours ago||

> They are known to say "horrific things" and personification of the AI they are delivering. It sometimes sounds unprofessional even.

It doesn’t sound unprofessional— it sounds unethical. Either they’re making something that they genuinely believe is unsafe but don’t want to stop because, you know, that’s business! Have you seen how much this shit costs? Or they’re deliberately making the entire country feel unsafe because it looks great to investors. Either way, frankly, fuck them and everybody else playing this dumb billionaire’s game. They deserve every bit of static this dimwitted government levels at them.

chpatrick 12 hours ago||

Unless you think someone's going to build it, and either it's you or them, and you hope you can do it less horrifically.

DrewADesign 6 hours ago||

And that somehow stops the other entity from building theirs anyway?

If Dario was altruistically trying to save us from the supposedly evil other party rather than pursuing oceans of cash, he’d have stayed in the nonprofit AI research space.

rubymamis 14 hours ago||

Fable was the only model that was able to detect a data corruption bug in my Qt C++ note-taking app[1] that all other tested models (gpt-5.5 xhigh, GLM-5.1, Kimi 2.7, DeepSeek V4 Pro) didn't find. I'll test on GLM-5.2 and Mimo v2.5 Pro soon.

[1] https://www.get-notes.com

king_phil 12 hours ago|

I asked Fable on max to create a mathematical model to show that c (speed of light) is emergent from pregeometric physics.

It said: I can't, but it would be lazy to say that is is not a possibility.

With some back and forth it created a 5 step plan to narrow down if our universe has all the right properties for this to be true.

We evaluated the first four stages to be true, and it wrote the solver to find out if the fifth test running the full model passes, but that will take thousands of hours of compute.

jaggederest 17 hours ago|

In my brief experience, the difference between fable and opus is largely in persistence, not global intelligence like you might expect. Fable just... goes the extra mile, sometimes in a scary way.

hodgehog11 17 hours ago||

Hard disagree. Opus reports to me like a student. Fable reported to me like a colleague (researcher). It genuinely seemed to pick up on nuance that the other models just don't, even when I tell them explicitly. It's been really frustrating that neither Codex nor Opus can make targetted edits to Fable's code without screwing something subtle up. For context, this is for computational geometry work, so your mileage may vary.

lukeschlather 17 hours ago|||

Fable happened to be released after I had been experimenting with Claude Code for roughly two weeks. I had been trying to use Sonnet, and when I switched to Opus it was night and day. My understanding of geometry was maybe not as good as it should've been, and I kept seeing Sonnet say things I knew were wrong but didn't know enough about 6DOF camera positioning to ask it to fix. I finally asked the right questions, it couldn't answer them at all, I switched to Opus, it was night and day. But! Opus still couldn't really keep 6DOF "in its head." When I left it to its own devices it tended to come back having forgotten that it needed to keep 6 degrees of freedom in its head and collapsed the problem down to 3DOF or just a single angle.

Fable just understood what I was talking about and never needed me to stop it and say "you forgot this thing we talked about." The difference in spatial reasoning capability between the three models is very very palpable. I am curious to get more time with it because ultimately I feel like I sandbagged it by giving it problems that would've been within Opus' abilities, but required a lot more handholding.

raphman 17 hours ago||||

> It's been really frustrating that neither Codex nor Opus can make targetted edits to Fable's code without screwing something subtle up.

Reminds me of the old adage: don't try to be too smart when writing code. Otherwise, dumber people - including your future self - will have trouble working with it.

murkt 16 hours ago|||

Some problems are very hard to solve with stupid code. This can easily be the case (computational geometry)

mejutoco 15 hours ago|||

For reference:

if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it

raphman 14 hours ago||

Ah thanks - I couldn't remember the original version.

For reference: it's called Kernighan's Law, and can be found in the Second Edition of "The Elements of Programming Style", page 10 [1].

The original phrasing is:

> Everyone knows that debugging is twice as hard as writing a program in the first place. So if you’re as clever as you can be when you write it, how will you ever debug it?

[1] https://archive.org/details/the-elements-of-programming-styl...

mejutoco 12 hours ago||

It seems I was not able to either, and I trusted google AI snippet. Thanks

mohsen1 17 hours ago||||

Yes, in my project I made so much more progress in 3 days of Fable that is not comparable to how Opus is working.

sigbottle 17 hours ago||

To be fair, labs silently nerf models all the time.

Fable's probably objectively better at full power. I mean, I definitely felt the same difference in competency between Fable and current Opus. But Opus itself has definitely been nerfed, and Fable, even if it comes back the public forever (probably won't), will get nerfed.

hypfer 17 hours ago||

I remember a time where a product didn't suddenly get worse while you were blinking.

That was a nice time. Let us get back to that time. Use open weights models. Own stuff.

TeMPOraL 14 hours ago||

That was before SaaS became a thing. Products didn't degrade over time because they couldn't easily reach out to your machine and remotely overwrite bytes on the CD-ROM the product came on.

hypfer 17 hours ago||||

Wait, so..

This is interesting. The "reported to me like a colleague" part.

Is it just that anthropic gave Mythos even more of that Anthropic™ character, (incorrectly) radiating confidence?

Is that why people have been losing their minds over that thing? Is this just cheap social engineering?

I mean I bet it is also slightly more capable than opus, but that would all check out to me. Man.

Thanks for sharing I suppose.

8note 17 hours ago|||

the primary difference i noticed is that fable didnt try to check in every minute

to an extent that might have done it, but i had been playkng around ahead of time trying to reverse engineer my ray bans case so i can make my own plastic insert, and fable to opus' work from mostly broken to mostly done, and then when fable went away, opus broke it again

TylerE 17 hours ago|||

No, it’s just a fundamentally much better model. Going back to Opus feels like the model has been lobotomized. It makes much more frequent errors, especially of the “I claimed I tested x y and z, but actually only kinda half heartedly tested x, and assumed I understood what was wrong” variety.

hypfer 17 hours ago||

Wait but that has been the exact word-for-word complaint when comparing sonnet to opus

Or opus to opus

Or really any new thing to old thing

cpburns2009 8 hours ago|||

You hear the same canard every time Anthropic releases a new model or version. I'm not convinced they're objective anecdotes. I wonder if it's simply the new model, while marginally better, has a different style and people find that new/refreshing. That is what makes it feel so much better than the previous release.

solumunus 17 hours ago|||

When the agent is becoming more accurate and thorough what would you expect to be reported?

hypfer 17 hours ago||

Oh I am sure that it became somewhat more accurate, and with that, the labeling there is in fact technically correct. It just does not work as an explainer for the doomsday-ish hype that model has induced in a lot of people's brains.

The user here is right in what they said but wrong in why they said it, essentially.

ben_w 16 hours ago|||

An analogy I keep coming back to with the current progress in LLMs is the progress in the 90s of 3D game engines.

Every upgrade made what came before it appear awful in comparison, to such an extent that every upgrade was called "photorealistic" and people kept forgetting that they'd been using that description for the previous engines that they were now dismissing.

https://archive.org/details/nextgen-issue-26

TylerE 17 hours ago|||

That’s a rather bad faith framing, I think. Who are you to judge why I said something?

hypfer 17 hours ago||

A person with the exact kind of pattern matching brain disorder this tech has been modeled after.

I do make mistakes though. Please check results.

dimgl 17 hours ago||||

Maybe I was getting downgraded to Opus 4.8 but I saw nothing even close to resembling this behavior when using Fable.

hodgehog11 15 hours ago||

It very much depends on the task. What were you trying it on?

saberience 10 hours ago|||

Funny, I find Codex to still be better at Coding than Opus or Fable.

I A/B tested on a whole array of prompts between Codex and Fable, and Fable almost always found that Codex had produced a better plan and covered more edge cases than it did itself.

For every problem I gave the exact same prompt to both models, then I had each analyze the other's output. For roughly 80% of the prompts, Fable acknowledged that Codex's output was an improvement on its own, for 20% the converse situation occurred.

There was one egregious case where Fable suggested deploying code which would have resulted in a production bug, an edge-case which Codex identified and proposed a fix.

Note: this is all for optimized Rust code designed to be highly CPU and memory efficient.

I do prefer Anthropic's models for any tasks with front-end/design work needed. But I don't do much of that kind of work usually.

hodgehog11 6 hours ago||

I've used them back to back as well. Codex is good at specific tasks; it doesn't try to go big, it does what it's told provided the task is relatively procedural. If Codex can make progress on a task, why would I give it to Fable?

Fable fumbled the one simple task that I gave it too. I gave it multiple very hard open-ended tasks (effectively math tasks) involving research code and it crushed them. It's the first model I've seen that can do that. The current Codex will never produce the type of code Fable gave me no matter how many times I run the same problem at it, because it won't stop trying naive rubbish. And if I tell Codex to try to improve the code, it can't figure out why trying the same classical tricks isn't making it work better, regardless of what I tell it. Opus is marginally better because it can at least recognize some subtleties over time, but still disappointing because it has no idea how to deal with them.

Most programmers want precision instruments for their workflow. That's fine, use the right tool for the job. In my line of work, I need crazy solutions because the obvious stuff doesn't work. That's where Fable shined for me.

Tossrock 17 hours ago|||

I found Fable to be both more intelligent and much better at pursuing complex goals than any previous model. I was impressed enough that I wrote up my experience – it's a little unusual because it was on open source code, so I could post the full session transcript and commits, if people want to judge for themselves https://tossrock.substack.com/p/36-hours-with-fable

baq 16 hours ago|||

You might have found a use case on which both have same capabilities, but this is in general very not true. I’ve had Fable autonomously fix concurrency bugs by itself other models couldn’t even diagnose from logs.

Perhaps it is a lot of small improvements all over the place, but the sum is a step change in capability.

somesortofthing 17 hours ago||

In LLMs, much like in humans, agency and misalignment are two sides of the same coin.

andsoitis 17 hours ago||

> agency and misalignment are two sides of the same coin.

The free will coin?

ben_w 16 hours ago||

In my experience "free will", like "consciousness" and "common sense", is not so much a concept with a universally agreed definition as it is a cognitive stop sign or an applause light, meaning different things to everyone who uses the term.

Do I have free will, or am I bounded by the laws of physics?

Even if you think my soul is completely independent of my body, there are theologians who argue that God being omniscient means that who goes to heaven and hell is predetermined before birth and therefore no action you take will ever change the afterlife you go to, and that to think God isn't omniscient would be blasphemy; do they think I have free will?

And then there's Thelma with "Do what thou wilt shall be the whole of the Law", which can be understood in terms of (amongst other things) "Don't let peer pressure manipulate you into thinking you want other things than you really want", though this is of course a simplification much as the omniscient example above: https://en.wikipedia.org/wiki/True_Will

TheOtherHobbes 12 hours ago||

Of all of the concepts like "consciousness" and "agency", "free will" is probably the least useful and poorly defined.

It's a hand-me-down from Western beliefs about morality and individuality - including Thelema and Christianity.

So there's a lot of starting from the concept and working back to assumed conclusions.

Generally humans do not have free will, do have very limited political, economic, and psychological agency, usually selected from a small number of competing rule sets, and are also far more easily influenced than they suspect.

Culture is more like a cellular automaton or diffusion system. Occasionally a transformation ripples out from an individual cell, often for fairly random reasons, but the big patterns are emergent, and every so often the soup shakes itself up and settles into a new arrangement.

IMO LLMs are the most recent proto-version of that, running on a different substrate.

More comments...