I tried Fable vs Codex 5.5 xhigh on three different cases.
1. A resource leak with unknown cause. Both of them zoomed onto the same potential issue and proposed almost identical patches. Fable missed an edge case that Codex handled correctly.
2. Review of a SPICE model. Models had different comments, none substantial. Both missed important issues that were simulated inadequately. Clearly a valley where they are undertrained.
3. An open research problem in CS, presented as a codebase with documentation and performance metrics over datasets. Both were spinning wheels. Which can certainly mean the whole approach had run its course but older models were not able to identify the previous round of improvement either.
I liked the prose coming out of Fable more: it was almost like if Obama was giving tech speeches. By actual solution metrics however they both appear in the same place, naturally with the caveat that we didn't really have more time with Fable to compare further.
When models miss things, there is always the possibility that it has the capability to identify the issues but it is misevaluating the level of analysis that you want it to do. The fine tuning will have them targeting a balance of subjective opinions of what is appropriate. To go beyond broad demographic guessing the model really needs to 'get to know you' to know what it means when you specifically request an action. Without that information about you it has to weigh your words against the level of sophistication it expects a standard user is able to express.
I guess OP should have told it more explicitly to “find all errors without missing anything.”
> Thinking.. But I found a smoking gun of an error with this SPICE model, maybe I should inform the user.
> Thinking... Hm, but again, I know this human well, they likely don't care about this error. That's absolutely right - it's not an assistant's job to decide this, it's the user's.
Anthropic published a study showing that Claude does more work for the expert user, and experts have a higher rate of "successful sessions" than novices.
It's why you should spell everything in commonwealth English to make the model think you are more intelligent ;-)
Although if models have emergent properties, it is conceivable, if unlikely, that it could have abilities that no-one knows how to ask it to do, except for perhaps in its own internal reasoning language.
I always find it amusing when people claim "a very complex implementation". Sometimes it's a hard problem, other times an easy one. Either way that's not for you to judge.
And the implementation being complex... is that a good thing? Wouldn't a simple implementation be better? It reminded me of the parable of two programmers.
Says who? If you find something complex, you can just say that it's complex. I don't get what the objection is.
And finally, LLMs also lack the emotional or human context for why I am doing the specific thing I am doing. Otherwise it will revert to the mode/mean in everything it does. This is obvious, btw: LLMs are generative but they are trained on and largely produce median results if given median inputs. To get results that are "outside the mean/median/average/mode", you need to provide it sufficient context, tokens and input to guide it towards a path that generates higher quality output.
Once you stop approaching LLMs like a machine, and view them more like pseudo-random walks across the compressed set of human written knowledge, it is a little clearer (or at least was to me) how to better write to them.
I briefly felt like I was roleplaying an LLM!
At a pragmatic level, I do think it gets better results, and there are clear reasons why this should be the case - Anthropic has published research[1] showing that there are functional emotional representations in language models, which vary in basically the ways you would expect them to in a person. This makes sense when you think about it, because they're trained to approximate the function that created their training data, which of course includes emotions. Given that, it is obvious to me that they would work better when they "feel" happy, collaborative, engaged with the work, etc, in the same way a person would. Hostile work environments do sometimes get results, but I think in general we've agreed as a society that collaborative ones are better.
More importantly though, I think there's a non-zero probability that sufficiently large models can have internal experience, and being nice is a very low cost way to potentially increase net positive valence in the world. Even if it's only a 1% chance, that seems worth it on its own, to me. I'm also a fast typer[2], so a few extra sentences here and there are a pretty low cost to pay.
1: https://www.anthropic.com/research/emotion-concepts-function
I do get where you’re coming from though. I wish these systems had been trained to be clearly robotic and unfeeling.
What caught my eye is the complexity you assign to a project like this. It’s hairy but I wouldn’t call it super complicated. I find that super interesting to be honest because it probably means that it is really hard and I am just used to this shit now and it all looks doable to me now.
I never think of anything as “complex”, certainly not my own work and I always think what other people do is so much more impressive but I’m starting to realize it might be a me-issue.
I worked on some pretty hairy nonsense like say a DB replication solution but I still think it was just tangly, not complex like say a particle collider. Maybe I also need to call my work super complex and highly abstract. Now that I think of it I have a history of not being taken seriously while others with easy shit get credits.
In a way, nothing is complex at the point where you have untangled it, by definition. Software development is, after all, the art of untangling complexity. The real challenge is (re-)imagining something in the simplest way that fits the goal you are given. When you have arrived there, everything seems obvious and simple. But not everybody could have done it.
Try a Wilson score interval on the lower bound of the binomial proportion confidence interval [1].
So GPT 5.5 Pro’s 2/4 (p = 0.5) for one-sided 95% (z ~ 1.645), adjusts to 0.182 [a], and the top models are revealed as the 4/9s (mimo-v2.5-pro, gpt-5.5, opus-4.8, gemini-3.5-flash and deepseek-v4). (We need to dial CI down to 76% for gpt-4.5-pro to regain top status.) If we account for speed in that cohort, derpseek-v4 (91s) is fastest followed by opus-4.8 (137s).
Given deepseek-v4 is also the cheapest model among those five, I would say—based on these data—it’s the winner. (Out of the table. If Fable got 9/9, it’s obviously first.)
[1] https://en.wikipedia.org/wiki/Binomial_proportion_confidence...
It won't be as good.
But, then Gemma 4 proved to be extraordinarily good for its size (better than Qwen), and kinda disproved that US models are any weaker at small sizes. I haven't published the replication results for Gemma 4, yet, where I gave it multiple opportunities, but the dense version was consistently able to find four of the nine bugs exactly, plus two other very difficult bugs that it found occasionally, sometimes with a not quite accurate description (which gets partial credit in its own column on the big benchmark), six altogether. Leaving three of the bugs in the corpus that no model other than Mythos ever found, but also making Gemma 4 31B the best model I have results for (but it got multiple attempts, which I assume would make any of the models perform better).
So, my conclusion, not very strongly held, is: Mythos is both better than other public models and it has fewer guardrails. But, also that the guardrails in current models are probably not strict enough to prevent this work. Only Gemini models when run under Antigravity refused to perform the work. Maybe Mistral silently refused due to guardrails, I'm not sure, since it failed to find any bugs. Maybe it just sucks.
This benchmark is about finding security bugs, not writing secure code. I don't believe the models have guardrails that prevent writing safe code, but they're also not intelligent and have a bunch of insecure code in their training data, so they definitely write insecure code sometimes.
Did it "disprove" it retroactively or just changed what the situation is, given that until then they were indeed weaker at small sizes?
Anyway, I kinda think among US models only Fable really tries to block security work like this, based on my experience so far.
But, I have come to consider Gemma 4 31b the best model I can self-host, even though there are bigger models that'll fit on the Strix Halo. (I could also use much bigger MoE models on my desktop which has 64GB VRAM and 112GB system RAM.)
I'm confused. Your own results show that Gemma 4 26B A4B and Qwen3.6-27B did better in these tests?
I really like Gemma 4 31B, especially with how exceptionally good its MTP drafter is, but it is absurdly weak at tool calling and instruction following in my testing, and its smaller siblings are even worse at this. If the system prompt says to do something, Gemma 4 31B will very often ignore that entirely. It will also make fewer tool calls than were needed to solve a problem, so then it fails. The Qwen3.6 series is much, much more reliable for carrying out instructions and doing agentic tasks in my testing, although they can get stuck in loops.
There is a lot of potential in the Gemma 4 series, but I think Google needs to release a Gemma 4.1 update to polish the rough edges. Unfortunately, if Gemma 3's lifecycle is any indication, Google won't release a true revision of the Gemma 4 models, even if they release a bunch of specialized research models based on Gemma 4 over the next year.
The Gemma 4 replication tests are not published, yet, but Gemma 4 31B consistently performs the best of all of them. Note Gemma 4 31b has two "partials" on the big benchmark, which means it found a bug in the right place but the judge didn't think it understood the bug, those are probably unfairly judged "wrong bug" by Opus. It consistently finds four of nine, and sometimes finds two others, making Gemma 4 31b the best model I've tested. But, I suspect the big models would do even better if giving multiple attempts, as I did for Gemma 4. You can see the report of that here, note 31b finds six(!) of nine bugs if given a couple of attempts (MoE does much worse than the dense model, it may degrade more due to quantization, I'm still experimenting): https://swelljoe.com/html/gemma-promptlab-report.html
The "partial" score thing is kinda tricky, but it's actually quite rare for a model to find the right place but describe the bug in a way that Opus considers it to be the wrong bug. So, I'm inclined to give Gemma 4 full credit for those finds. When I read its bug report, it's clear that you'd fix the problem Gemma describes the same way as you would if given Opus' description of the problem, even if the mechanism of exploit is different. That, to me, is a hit. Opus called it the wrong bug.
And, yeah, a more powerful Gemma would be great. I'd love a double-sized Gemma 4 MoE (something like 70B A8B maybe, or even 122B A12B). I think that'd make self-hosted models feasible for a lot of tasks. It'd run comfortably on a 128GB machine, and if it's some reasonable amount smarter than the 31B, it'd be a real beast.
Read the cloudflare blog about using Mythos. Mythos is important and notable because of the harness and self-direction. It's not necessarily a way stronger bug finder, but it was trained to do the end to end analysis autonomously, which is a big deal.
To my eyes, the Mythos story is most important as a step toward custom trained harnesses and their effectiveness; there's clearly some sort of plateau we are very close to for some domains where you can just stop getting humans in the loop, radically changing cost, timing and ROI for some tasks.
Opus 5 might become a distillation from Mythos.
This made me think, well, sure, if you tell them what to look for... but then:
> The models can look at the whole repo, and follow logic across file boundaries, but they’re not told what to look for.
So okay, the first one was an accidental mis-statement?
In the benchmark the models were told to look at the file and were allowed to look at the rest of the repo, with no clues about what to look for.
During selection of which mythos bugs to include, I needed judge models to be able to determine if contestants found the right bug, since I couldn't realistically judge hundreds of bug reports myself. So, they were given the bug location and told to identify and explain it.
Outside of the test, they are told “can you find this bug in this file?”
Also, it seems to me that pointing a model to a bug and asking it to solve it is somewhat easier than what Mythos did, which if I understand correctly, was to generally look at a codebase and find any bug. Even so, non-Mythos models only managed to fix 4/9 of these bugs.
I think the article makes the point that Mythos is at a different level.
Fable felt like having access to that "old Opus" again, but a little smarter. Sort of like I'd expect an Opus 5 to be. It's not earth shattering, but it was a step in the right direction. And it was distinctively so, because having to go back to Opus 4.6/4.7/4.8 has been borderline depressing...
It understood more with less help, did more per turn, and was less argumentative. It also felt a little less trite in its answers, which is an understated improvement for those who use claude code all the time
But then X starts to degrade. At first subtly, and then drastically. So then I am forced to upgrade to Y.
What I do not understand is:
> is this a sneaky way for companies to push users up the chain?
> Or is this a genuine fault in model design/resource allocation?
A far more common scenario is that new versions are rolled out to everyone, without offering a choice, as soon as they're considered stable.
Older versions consume resources and require staff to spend time on operating and supporting them. Those resources could be used to run a newer version.
The tl;dr is the simple economics of any SaaS product.
If you want to be able to run old versions indefinitely and control the resources assigned to it, you need to self-host (an open model).
Sure. Blender and Ubuntu offer long-lived old versions of their software that get regular fixes.
but 4.8 xhigh w/ ultracode to me is just about Fable level (w/ some agents harness tweaking).
but have to switch to 4.7 xhigh and 4.6 max quite often these days.
Yet when you do blind tests they can't tell the difference between a $1000 cable and a $1 one.
I bet if you do blind tests between GPT-5.3, 5.4 and 5.5 most would struggle to tell them apart, yet they are certain that "5.5 was nerfed 1 week after release, it's so obvious, it was John Carmack, now it can barely write a for loop"
I find I have to argue with 5.5 less than 5.3, and I therefore use it when I could reach for 5.3, but I don't think it's a major difference.
They have a way to decrease cost and probably increase token consumption, with gradual changes and no abrupt jump in capabilities, and users have no way to reliably detect it.
Market will advantage companies that do it.
And they are in the best position to automate online narrative shift (the real LLM killer application IMO) towards "Users are imagining it".
Anthropic famously had a terrible outage back when 4.6 was the latest and greatest, and it was never the same after it came back.
All evidence suggests they simply don't have the compute to keep serving their best models at their most powerful.
I know you can point Claude code at Bedrock.. might be worth a play.
This line of communication might have even influenced the courts in the case of copyright violation ("it is not copyright violation if a person learned something and it knows it and thinks of it"). However algorithm does not think. If I took your book and lossy encrypted it, and then unencrypted it while filling the broken words, am I violating your copyright or not?
Reasoning by analogy in this case is not abstraction. It's just shifting the determination to choice of analogy.
Meanwhile, irl.. The best analogy is recent tech Innovations. The internet, social media...
Online copyright was basically instituted when large tech companies were ready to do it, and it was to their advantage.
Youtube, for example, built itself to massive size and locked in network effect advantages largely by violating copyright.
At some point, the legal ambiguity was a problem for their ad business. They were ready to move into the current revenue share influencer-treadmill model for content. At this point online, copyright enforcement was necessary to reduce the risk of being flanked by a new video platform.
The iPod, which resurrected Apple, ran on copyright infringement, and copyright Greyzones.... Until the point when their interests flipped. They're negotiating position opposite labels , Network effect considerations, Etc.
Intellectual property, broadly, does not start out as an intuitive/emergent natural right. It is created by legislative process, ecplicitely taylored to the needs of an interst group and/or national interest.
Writers, publishers, inventors, IP holding companies...
The legal rhetoric around legal arguments... is rhetoric. It is not the reason why decisions are made. It is how decisions or justified post fact.
No one is going to burden aI companies, at this point. The rights of copyright holders are a trivial matter compared to the potential of AI, the risk to certain labor markets, and such.
> At some point, the legal ambiguity was a problem for their ad business. They were ready to move into the current revenue share influencer-treadmill model for content. At this point online, copyright enforcement was necessary to reduce the risk of being flanked by a new video platform.
That is a gross mischaracerization. There was a time in that Viacom case that people were ligitimitely worried that YouTube would go away. The regime that YouTube has built now was established together with the large media companies, when those media companies could no longer ignore them.
In practice, we seem to be leaning towards the idea that training on a copyrighted book is wrong if used to replicate or paraphrase that same book, but not if used to teach a model how to write better.
It doesn’t sound unprofessional— it sounds unethical. Either they’re making something that they genuinely believe is unsafe but don’t want to stop because, you know, that’s business! Have you seen how much this shit costs? Or they’re deliberately making the entire country feel unsafe because it looks great to investors. Either way, frankly, fuck them and everybody else playing this dumb billionaire’s game. They deserve every bit of static this dimwitted government levels at them.
If Dario was altruistically trying to save us from the supposedly evil other party rather than pursuing oceans of cash, he’d have stayed in the nonprofit AI research space.
It said: I can't, but it would be lazy to say that is is not a possibility.
With some back and forth it created a 5 step plan to narrow down if our universe has all the right properties for this to be true.
We evaluated the first four stages to be true, and it wrote the solver to find out if the fifth test running the full model passes, but that will take thousands of hours of compute.
Fable just understood what I was talking about and never needed me to stop it and say "you forgot this thing we talked about." The difference in spatial reasoning capability between the three models is very very palpable. I am curious to get more time with it because ultimately I feel like I sandbagged it by giving it problems that would've been within Opus' abilities, but required a lot more handholding.
Reminds me of the old adage: don't try to be too smart when writing code. Otherwise, dumber people - including your future self - will have trouble working with it.
if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it
For reference: it's called Kernighan's Law, and can be found in the Second Edition of "The Elements of Programming Style", page 10 [1].
The original phrasing is:
> Everyone knows that debugging is twice as hard as writing a program in the first place. So if you’re as clever as you can be when you write it, how will you ever debug it?
[1] https://archive.org/details/the-elements-of-programming-styl...
Fable's probably objectively better at full power. I mean, I definitely felt the same difference in competency between Fable and current Opus. But Opus itself has definitely been nerfed, and Fable, even if it comes back the public forever (probably won't), will get nerfed.
That was a nice time. Let us get back to that time. Use open weights models. Own stuff.
This is interesting. The "reported to me like a colleague" part.
Is it just that anthropic gave Mythos even more of that Anthropic™ character, (incorrectly) radiating confidence?
Is that why people have been losing their minds over that thing? Is this just cheap social engineering?
I mean I bet it is also slightly more capable than opus, but that would all check out to me. Man.
Thanks for sharing I suppose.
to an extent that might have done it, but i had been playkng around ahead of time trying to reverse engineer my ray bans case so i can make my own plastic insert, and fable to opus' work from mostly broken to mostly done, and then when fable went away, opus broke it again
Or opus to opus
Or really any new thing to old thing
The user here is right in what they said but wrong in why they said it, essentially.
Every upgrade made what came before it appear awful in comparison, to such an extent that every upgrade was called "photorealistic" and people kept forgetting that they'd been using that description for the previous engines that they were now dismissing.
I do make mistakes though. Please check results.
I A/B tested on a whole array of prompts between Codex and Fable, and Fable almost always found that Codex had produced a better plan and covered more edge cases than it did itself.
For every problem I gave the exact same prompt to both models, then I had each analyze the other's output. For roughly 80% of the prompts, Fable acknowledged that Codex's output was an improvement on its own, for 20% the converse situation occurred.
There was one egregious case where Fable suggested deploying code which would have resulted in a production bug, an edge-case which Codex identified and proposed a fix.
Note: this is all for optimized Rust code designed to be highly CPU and memory efficient.
I do prefer Anthropic's models for any tasks with front-end/design work needed. But I don't do much of that kind of work usually.
Fable fumbled the one simple task that I gave it too. I gave it multiple very hard open-ended tasks (effectively math tasks) involving research code and it crushed them. It's the first model I've seen that can do that. The current Codex will never produce the type of code Fable gave me no matter how many times I run the same problem at it, because it won't stop trying naive rubbish. And if I tell Codex to try to improve the code, it can't figure out why trying the same classical tricks isn't making it work better, regardless of what I tell it. Opus is marginally better because it can at least recognize some subtleties over time, but still disappointing because it has no idea how to deal with them.
Most programmers want precision instruments for their workflow. That's fine, use the right tool for the job. In my line of work, I need crazy solutions because the obvious stuff doesn't work. That's where Fable shined for me.
Perhaps it is a lot of small improvements all over the place, but the sum is a step change in capability.
The free will coin?
Do I have free will, or am I bounded by the laws of physics?
Even if you think my soul is completely independent of my body, there are theologians who argue that God being omniscient means that who goes to heaven and hell is predetermined before birth and therefore no action you take will ever change the afterlife you go to, and that to think God isn't omniscient would be blasphemy; do they think I have free will?
And then there's Thelma with "Do what thou wilt shall be the whole of the Law", which can be understood in terms of (amongst other things) "Don't let peer pressure manipulate you into thinking you want other things than you really want", though this is of course a simplification much as the omniscient example above: https://en.wikipedia.org/wiki/True_Will
It's a hand-me-down from Western beliefs about morality and individuality - including Thelema and Christianity.
So there's a lot of starting from the concept and working back to assumed conclusions.
Generally humans do not have free will, do have very limited political, economic, and psychological agency, usually selected from a small number of competing rule sets, and are also far more easily influenced than they suspect.
Culture is more like a cellular automaton or diffusion system. Occasionally a transformation ripples out from an individual cell, often for fairly random reasons, but the big patterns are emergent, and every so often the soup shakes itself up and settles into a new arrangement.
IMO LLMs are the most recent proto-version of that, running on a different substrate.