ARC-AGI-3 - Hacker News

Posted by lairv 11 hours ago

ARC-AGI-3(arcprize.org)

https://arcprize.org/media/ARC_AGI_3_Technical_Report.pdf

297 points | 189 commentspage 4

6thbit 9 hours ago|

Not clear to me the diff with v2?

ACCount37 9 hours ago||

They stacked the deck. If v2 was still rule inference + spatial reasoning, a bit like juiced up Raven's progressive matrices, then v3 adds a whole new multi-turn explore/exploit agentic dimension to it.

Given how hard even pure v2 was for modern LLMs, I'm not surprised to see v3 crush them. But that wouldn't last.

jasonjmcghee 9 hours ago||

v2 was a static fill in the blank task instead of v3 which is interactive.

There's world state that you can change. Not just place pixel.

Here's v2:

https://arcprize.org/tasks/ce602527

jmkni 8 hours ago||

ok clearly I'm a robot because I can't figure out wtf to do

dinkblam 10 hours ago||

what is the evidence that being able to play games equates to AGI?

modeless 9 hours ago||

The test doesn't prove you have AGI. It proves you don't have AGI. If your AI can't solve these problems that humans can solve, it can't be AGI.

Once the AIs solve this, there will be another ARC-AGI. And so on until we can't find any more problems that can be solved by humans and not AI. And that's when we'll know we have AGI.

observationist 9 hours ago|||

AI X that can solve the tests contrasted with AI Y that cannot, with all else being equal, means X is closer to AGI than Y. There's no meaningful scale implicit to the tests, either.

Kinda crazy that Yudkowsky and all those rationalists and enthusiasts spent over a decade obsessing over this stuff, and we've had almost 80 years of elite academics pondering on it, and none of them could come up with a meaningful, operational theory of intelligence. The best we can do is "closer to AGI" as a measurement, and even then, it's not 100% certain, because a model might have some cheap tricks implicit to the architecture that don't actually map to a meaningful difference in capabilities.

Gotta love the field of AI.

rolux 8 hours ago||||

Will there be a point in that series of ARC-AGI tests where AI can design the next test, or is designing the next text always going to be a problem that can be solved by humans and not AI?

modeless 7 hours ago||

I don't see why AI couldn't design tests. But they can only be validated by humans, as they are intended to be possible and ideally easy for humans to solve.

famouswaffles 9 hours ago|||

>It proves you don't have AGI.

It doesn't prove anything of the sort. ARC-AGI has always been nothing special in that regard but this one really takes the cake. A 'human baseline' that isn't really a baseline and a scoring so convoluted a model could beat every game in reasonable time and still score well below 100. Really what are we doing here ?

That Francois had to do all this nonsense should tell you the state of where we are right now.

ACCount37 9 hours ago|||

None whatsoever.

It's a "let's find a task humans are decent at, but modern AIs are still very bad at" kind of adversarial benchmark.

The exact coverage of this one is: spatial reasoning across multiple turns, agentic explore/exploit with rule inference and preplanning. Directly targeted against the current generation of LLMs.

arscan 9 hours ago|||

I think the idea is that if they cannot perform any cognitive task that is trivial for humans then we can state they haven’t reached ‘AGI’.

It used to be easy to build these tests. I suspect it’s getting harder and harder.

But if we run out of ideas for tests that are easy for humans but impossible for models, it doesn’t mean none exist. Perhaps that’s when we turn to models to design candidate tests, and have humans be the subjects to try them out ad nauseam until no more are ever uncovered? That sounds like a lovely future…

fsdf2 8 hours ago||

The reality is machines can brute force endlessly to an extent humans cannot, and make it seem like they are intelligent.

Thats not intelligence though. Even if it may appear to be. Does it matter? Thats another question. But certaintly is not a representation of intelligence.

furyofantares 9 hours ago|||

There isn't a strict definition of AGI, there's no way to find evidence for what equates to it, and besides, things like this are meant only as likely necessary conditions.

Anyway, from the article:

> As long as there is a gap between AI and human learning, we do not have AGI.

This seems like a reasonable requirement. Something I think about a lot with vibe coding is that unlike humans, individual models do not get better within a codebase over time, they get worse.

fragmede 9 hours ago||

Is that within a codebase off relatively fixed size that things get worse as time goes on, or are you saying as the codebase grows that the limits of a model's context means that because the model is no longer able to hold the entire codebase within its context that it performs worse than when the codebase was smaller?

furyofantares 9 hours ago||

I think there's a few factors, codebase size is one, and the tendency for vibe coding to be mostly additive certainly doesn't help with that.

But vibe coding also tends to produce somewhat poor architecture, lots of redundant and intermingled bits that should be refactored. I think the model is worse the worse code it has to work with, which I presume is only in part because it's fundamentally harder to work with bad code, but also in part because its context is filled with bad code.

observationist 9 hours ago|||

The evolution of the test has been partly due to the evolution of AI capabilities. To take the most skeptical view, the types of puzzles AI has trouble solving are in the domain of capabilities where AGI might be required in order to solve them.

By updating the tests specifically in areas AI has trouble with, it creates a progressive feedback loop against which AI development can be moved forward. There's no known threshold or well defined capability or particular skill that anyone can point to and say "that! That's AGI!". The best we can do right now is a direction. Solving an ARC-AGI test moves the capabilities of that AI some increment closer to the AGI threshold. There's no good indication as to whether solving a particular test means it's 15% closer to AGI or .000015%.

It's more or less a best effort empiricist approach, since we lack a theory of intelligence that provides useful direction (as opposed to a formalization like AIXI which is way too broad to be useful in the context of developing AGI.)

sva_ 9 hours ago|||

That is not the claim. It is a necessary condition, but not a sufficient one.

futureshock 9 hours ago|||

The evidence is that humans are able to win these games. AGI is usually defined as the ability to do any intellectual task about as well as a highly competent human could. The point of these ARC benchmarks is to find tasks that humans can do easily and AI cannot, thus driving a new reasoning competency as companies race each other to beat human performance on the benchmark.

didibus 9 hours ago||

> AGI is usually defined as the ability to do any intellectual task about as well as a highly competent human could

I think one major disconnect, is that for most people, AGI is when interacting with an AI is basically in every way like interacting with a human, including in failure modes. And likely, that this human would be the smartest most knowledgeable human you can imagine, like the top expert in all domains, with the utmost charisma and humor, etc.

This is why the "goal post" appears to be always moving, because the non-commoners who are involved with making AGI and what not never want to accept that definition, which to be fair seems too subjective, and instead like to approach AGI like something different, it can solve some problems human's can't, when it doesn't fail, it behaves like an expert human, etc.

Even if an AI could do any intellectual task about as well as a highly competent human could, I believe most people would not consider it AGI, if it lacks the inherent opinion, personality, character, inquiries, failure patterns, of a human.

And I think that goes so far as, a text only model can never meet this bar. If it cannot react in equal time to subtle facial queues, sounds, if answering you and the flow of conversation is slower than it would be with a human, etc. All these are also required for what I consider the commoner accepting AGI as having been achieved.

fragmede 9 hours ago||

By that definition, does a human at the other end of a high-latency video call not have AGI because they can't react any faster that the connection's latency would allow them to have? From your POV what's the difference between that and an AI that's just slow?

CamperBob2 11 hours ago||

Without reading the .pdf, I tried the first game it gave me, at https://arcprize.org/tasks/ls20, and I couldn't begin to guess what I was supposed to do. Not sure what this benchmark is supposed to prove.

Edit: Having messed around with it now (and read the .pdf), it seems like they've left behind their original principle of making tests that are easy for humans and hard for machines. I'm still not convinced that a model that's good at these sorts of puzzles is necessarily better at reasoning in the real world, but am open to being convinced otherwise.

WarmWash 10 hours ago||

The goal is to learn the rules, and then use that to win.

If you mess around a little bit, you will figure it out. There are only a few rules.

szatkus 11 hours ago||

> Only environments that could be fully solved by at least two human participants (independently) were considered for inclusion in the public, semi-private and fully-private sets.

Apparently those games supposed to be hard.

nubg 10 hours ago||

Any benchmarks?

gordonhart 9 hours ago|

The main frontier models are all up on https://arcprize.org/tasks

Barely any of them break 0% on any of the demo tasks, with Claude Opus 4.6 coming out on top with a few <3% scores, Gemini 3.1 Pro getting two nonzero scores, and the others (GPT-5.4 and Grok 4.20) getting all 0%

ACCount37 9 hours ago|||

Pre-release, I would have expected Gemini 3.1 Pro to get ahead of Opus 4.6, with GPT-5.4 and Grok 4.20 trailing. Guess I shouldn't have bet against Anthropic.

Not like it's a big lead as of yet. I expect to see more action within the next few months, as people tune the harnesses and better models roll in.

This is far more of a "VLA" task than it is an "LLM" task at its core, but I guess ARC-AGI-3 is making an argument that human intelligence is VLA-shaped.

gordonhart 9 hours ago|||

My broad vibe is that Gemini 3.1 Pro is the best at visual/spatial tasks and oneshotting while Opus 4.6 is the best at path planning. This task leans heavily on both but maybe a little more towards planning so I'm not too shocked that Opus in narrowly on top.

When running, the grids are represented in JSON, so the visual component is nullified but it still requires pretty heavy spatial understanding to parse a big old JSON array of cell values. Given Gemini's image understanding I do wonder if it would perform better with a harness that renders the grid visually.

culi 7 hours ago|||

Given the drastic difference in price, I think the chart definitely shows Gemini 3.1 in the best light. Google DeepMind is basically the same thing but they're willing to pay as much electricity as Anthropic is to achieve its benchmarks

thatguymike 8 hours ago|||

Curious, that doesn't match the graph up on the Leaderboard page? https://arcprize.org/leaderboard

gordonhart 7 hours ago||

The individual task scores are all on public tasks, they still held out a hundred or so private tasks that presumably GPT-5.4 did well on to get its leaderboard position.

saberience 8 hours ago||

So this is another ARC-"AGI" benchmark which is again designed around using eyesight for LLMs which are trained to be great at text, what is the point?

Yes, we get that LLMs are really bad when you give them contrived visual puzzles or pseudo games to solve... Well great, we already knew this.

The "hype" around the ARC-AGI benchmarks makes me laugh, especially the idea we would have AGI when ARC-AGI-1 was solved... then we got 2, and now we're on 3.

Shall we start saying that these benchmarks have nothing to do with AGI yet? Are we going to get an ARC-AGI-10 where we have LLMs try and beat Myst or Riven? Will we have AGI then?

This isn't the right tool for measuring "AGI", and honestly I'm not sure what it's measuring except the foundation labs benchmaxxing on it.

diablevv 3 hours ago||

[dead]

tasuki 10 hours ago|

So ARC-AGI was released in 2019. That's been solved, then there was ARC-AGI-2, and now there's ARC-AGI-3. What is even the point? Will ARC-AGI-26 hit the front page of Hacker News in 2057 ?

muskstinks 10 hours ago||

This is clear AGI progress. It should show you, that AI is not sleeping, it gets better and you should use this as a signal that you should take this topic serious.

applfanboysbgon 10 hours ago||

Labelling a test "AGI" does not show AGI progress any more than labelling a cpu "AGI" makes it so. It might show that AI tools are improving but it does not necessarily follow that tools improving = AGI progress if you're on the completely wrong trail.

muskstinks 9 hours ago|||

The transfer of knowledge required here is that a ARC-AGI-3 is now necessary and adds another dimension of capability.

These 'tests' are not labeled AGI by magic but because they are designed specificly for testing certain things a question answer test cant solve.

Gemini and OpenAI are at 80-90% at ARC-AGI-2 and its quite interesting to see the difference of challange between 2 and 3.

AGI progress means btw. general. So every additional dimension an agent can solve pushes that agent to be more general.

zarzavat 9 hours ago||||

Any test that humans can pass and AIs cannot is a stepping stone on the way to AGI.

When you run out of such tests then it's evidence that you have reached AGI. The point of these tests is to define AGI objectively as the inability to devise tests that humans have superiority on.

gordonhart 10 hours ago|||

The point is still to test frontier models at the limit of their capabilities, regardless of how it's branded. If we're still capable of doing so in 2057 I'll upvote the ARC-AGI-26 launch post!

futureshock 9 hours ago|||

Well yes, that is exactly the point! The very purpose of the ARC AGI benchmarks is to find a pure reasoning task that humans are very good at and AI is very bad at. Companies then race each other to get a high score on that benchmark. Sure there’s going to be a lot of “studying for the test” and benchmaxing, but once a benchmark gets close to being saturated, ARC releases a new benchmark with a new task the AI is terrible at. This will rinse and repeat till ARC can find no reasoning task that AI cannot do that a human could. At that point we will effectively have AGI.

I believe the CEO of ARC has said they expect us to get to ARC-AGI-7 before declaring AGI.

didibus 9 hours ago|||

It helps the model makers have a harness to optimize for in their next model version.

They'll specifically work to pass the next version of ARC-AGI, by evaluating what kind of dataset is missing that if they trained on would have their model pass the new version.

They ideally don't directly train on the ARC-AGI itself, but they can train in similar problems/datasets to hope to learn the skills that than transfer to also solving for the real ARC-AGI.

The point is that, a new version of ARC-AGI should help the next model be smarter.

tibbar 10 hours ago|||

The point is that ideally the models keep improving until they can solve problems people care about. Which is already partly true, but there are lots of problems that are still out of reach.

minimaxir 10 hours ago|||

It's semvar.

refulgentis 9 hours ago||

You’re absolutely right to point it out.

LLMs weren’t supposed to solve 1, they did, so we got 2 and it really wasn’t supposed to be solvable by LLMs. It was, and as soon as it started creeping up we start hearing about 3: It’s Really AGI This Time.

I don’t know what Francois’ underlying story is, other than he hasn’t told it yet.

One of a few moments that confirmed it for me was when he was Just Asking Questions re: if Anthropic still used SaaS a month ago, which was an odd conflation of a hyperbolic reading of a hyperbolic stonk market bro narrative (SaaS is dead) and low-info on LLMs (Claude’s not the only one that can code) and addressing the wrong audience (if you follow Francois, you’re likely neither of those poles)

At this point I’d be more interested in a write up from Francois about where he is intellectually than an LLM that got 100% on this. It’s like when Yann would repeat endlessly that LLMs are definitionally dumber than housecats. Maybe, in some specific way that makes sense to you. You’re brilliant. But there’s a translation gap between Mount Olympus and us plebes, and you’re brilliant enough to know that too. So it comes across as trolling and boring.