Posted by anabranch 6 days ago
To be clear, I'm not saying that it's a good thing, but it does seem to be going in this direction.
And junior devs have never added much value. The first two years of any engineer’s career is essentially an apprenticeship. There’s no value add from have a perpetually junior “employee”.
Under the hood, what was happening is that older models needed reminders, while 4.7 no longer needs it. When we showed these reminders to 4.7 it tended to over-fixate on them. The fix was to stop adding cyber reminders.
More here: https://x.com/ClaudeDevs/status/2045238786339299431
> 4.7 is quite... dumb. i think they have lobotomized this model
Is adaptive thinking still broken? Why was the option to disable it taken away?
Also there should be time distribution for the queries and a way to filter by query time. This is because Anthropic is reported to change the model quality arbitrarily in the background.
Also there is no unit in table column headers. For example "Request 4.7" is this the amount of tokens 4.7 consumes? Is it output/input/reasoning etc.
Really difficult to make sense of this.
People get offended if what they are doing is labeled as slop but this is unfortunately the level of quality I expect from AI related content or code.
This has resulted in +92.9% cost and token difference. Submission bd2457e5, currently at the top of the leaderboard.
It looks like you don't allow testing of anything beyond a certain token size.
Which makes your test kind of pointless, because if you are chatting about AI with something that's only a few hundred tokens, the data your collecting is pretty minimal and specific, not something that's generally applicable or relevant to wider user outside of that specific context.