Posted by tosh 7 hours ago
Previous models including Claude Opus 4.6 have generally produced a lot of noise/things that the compiler already reliably optimizes out.
Google has definitely been pulling ahead in AI over the last few months. I've been using Gemini and finding it's better than the other models (especially for biology where it doesn't refuse to answer harmless questions).
IMO it's the other way around. Benchmarks only measure applied horse power on a set plane, with no friction and your elephant is a point sphere. Goog's models have always punched over what benchmarks said, in real world use @ high context. They don't focus on "agentic this" or "specialised that", but the raw models, with good guidance are workhorses. I don't know any other models where you can throw lots of docs at it and get proper context following and data extraction from wherever it's at to where you'd need it.
Usually, when you decrease false positive rates, you increase false negative rates.
Maybe this doesn't matter for models at their current capabilities, but if you believe that AGI is imminent, a bit of conservatism seems responsible.
It's agents all the way down.
And I wonder how Gemini Deep Think will fare. My guess is that it will get half the way on some problems. But we will have to take an absence as a failure, because nobody wants to publish a negative result, even though it's so important for scientific research.
https://hn.algolia.com/?q=1stproof
This is exactly the kind of challenge I would want to judge AI systems based on. It required ten bleeding-edge-research mathematicians to publish a problem they've solved but hold back the answer. I appreciate the huge amount of social capital and coordination that must have taken.
I'm really glad they did it.
The beauty of this benchmark is that it takes all of two seconds to come up with your own unique one. A seahorse on a unicycle. A platypus flying a glider. A man’o’war piloting a Portuguese man of war. Whatever you want.
Edit: someone needs to explain why this comment is getting downvoted, because I don't understand. Did someone's ego get hurt, or what?
It was sort of humorous for the maybe first 2 iterations, now it's tacky, cheesy, and just relentless self-promotion.
Again, like I said before, it's also a terrible benchmark.
I was expecting something more realistic... the true test of what you are doing is how representative is the thing in relation to the real world. E.g. does the pelican look like a pelican as it exists in reality? This cartoon stuff is cute but doesnt pass muster in my view.
If it doesn't relate to the real world, then it most likely will have no real effect on the real economy. Pure and simple.
In contrast, the only "realistic" SVGs I've seen are created using tools like potrace, and look terrible.
I also think the prompt itself, of a pelican on bicycle, is unrealistic and cartoonish; so making a cartoon is a good way to solve the task.
ARC-AGI-3 has a nasty combo of spatial reasoning + explore/exploit. It's basically adversarial vs current AIs.
You would need to check to see if everyone is having mistakes on the same 20% or different 20%. If its the same 20% either those questions are really hard, or they are keyed incorrectly, or they aren't stated with enough context to actually solve the problem.
It happens. Old MMLU non pro had a lot of wrong answers. Simple things like MNIST have digits labeled incorrect or drawn so badly its not even a digit anymore.
Arc-AGI score isn't correlated with anything useful.
It's also interesting because it's very very hard for base LLMs, even if you try to "cheat" by training on millions of ARC-like problems. Reasoning LLMs show genuine improvement on this type of problem.
>can u make the progm for helps that with what in need for shpping good cheap products that will display them on screen and have me let the best one to get so that i can quickly hav it at home
And get back an automatic coupon code app like the user actually wanted.
Not interested enough to pay $250 to try it out though.