All those models that were just at version 1.x in 2024
That’s so wild
refulgentis 10 hours ago||
I liked both of Opus' better, it was very illuminating, in both cases I didn't see the error's Simon saw and wondered why Simon skipped over the errors I saw.
Pelican: saturated!
jedisct1 12 hours ago||
I'm currently testing Qwen3.6-35B-A3B with https://swival.dev for security reviews.
It's pretty good at finding bugs, but not so good at writing patches to fix them.
nba456_ 10 hours ago||
Good reminder that these tests have always been useless, even before they started training on it.
tmatsuzaki 7 hours ago||
[dead]
whywhywhywhy 10 hours ago||
[flagged]
simonw 10 hours ago|
If they're testing against it why do most of their attempts suck so much?
simon_is_genius 11 hours ago||
[flagged]
19qUq 12 hours ago||
How about switching to MechaStalin on a tricycle? It gets kind of boring.
mvanbaak 12 hours ago|
boring ... the ways all the models fail at a simple task never gets boring to me
throwuxiytayq 11 hours ago||
I literally cannot believe that people are wasting their time doing this either as a benchmark or for fun. After every single language model release, no less.
sharkjacobs 11 hours ago||
It feels like the results stopped being interesting a little while ago but the practice has become part of simonw's brand, and it gives him something to post even when there is nothing interesting to say about another incremental improvement to a model, and so I don't imagine he'll stop.
stephbook 11 hours ago||
I, for one, expected progress. Uneven, sometimes delayed, but ever increasing progress.
But that Opus pelican?
cedws 10 hours ago|||
It’s not a waste of time.
As the boundaries of AI are pushed we increasingly struggle to define what intelligence actually is. It becomes more useful to test what models cannot do instead of what they can. Random tasks like the pelican test can show how general the intelligence really is, putting aside the obvious flaw that the labs can optimise for such a simple public benchmark.
throwuxiytayq 1 hour ago||
The whole point of this benchmark is that it asks the model to work in a modality it is not trained in and does not understand well. The result is largely meaningless. This is just like the people who are endlessly surprised by the fact that a raw LLM does not work with numbers well, or miscounts letters. In short, this test benchmarks the intelligence of the person running it, not of the model.
recursive 10 hours ago|||
Fun is so un-productive. Everyone doing things for "fun" is going to be sorry when they look back and realizes they were wasting time having a "good time" rather than optimizing their KPIs.
throwuxiytayq 1 hour ago||
Sarcasm aside, asking LLMs do draw pelicans is your idea of fun? I'm worried for you.
bschwindHN 6 hours ago|||
I do wonder how much energy collectively has been burned on this useless "benchmark".
segmondy 11 hours ago|||
I can't believe you're such a party pooper. It's exciting times, the silly things do matter!
Marciplan 8 hours ago||
I also can't understand how this goes so viral every time on Hackernews lol