Top
Best
New

Posted by lairv 5 hours ago

ARC-AGI-3(arcprize.org)
https://arcprize.org/media/ARC_AGI_3_Technical_Report.pdf
218 points | 154 comments
Tiberium 3 hours ago|
https://x.com/scaling01 has called out a lot of issues with ARC-AGI-3, some of them (directly copied from tweets, with minimal editing):

- Human baseline is "defined as the second-best first-run human by action count". Your "regular people" are people who signed up for puzzle solving and you don't compare the score against a human average but against the second best human solution

- The scoring doesn't tell you how many levels the models completed, but how efficiently they completed them compared to humans. It uses squared efficiency, meaning if a human took 10 steps to solve it and the model 100 steps then the model gets a score of 1% ((10/100)^2)

- 100% just means that all levels are solvable. The 1% number uses uses completely different and extremely skewed scoring based on the 2nd best human score on each level individually. They said that the typical level is solvable by 6 out of 10 people who took the test, so let's just assume that the median human solves about 60% of puzzles (ik not quite right). If the median human takes 1.5x more steps than your 2nd fastest solver, then the median score is 0.6 * (1/1.5)^2 = 26.7%. Now take the bottom 10% guy, who maybe solves 30% of levels, but they take 3x more steps to solve it. this guy would get a score of 3%

- The scoring is designed so that even if AI performs on a human level it will score below 100%

- No harness at all and very simplistic prompt

- Models can't use more than 5X the steps that a human used

- Notice how they also gave higher weight to later levels? The benchmark was designed to detect the continual learning breakthrough. When it happens in a year or so they will say "LOOK OUR BENCHMARK SHOWED THAT. WE WERE THE ONLY ONES"

fc417fc802 2 hours ago||
Those are supposed to be issues? After reading your list my impression of ARC-AGI has gone up rather than down. All of those things seem like the right way to go about this.
girvo 1 hour ago||
Yeah I'm quite surprised as to how all of those are supposed to be considered problems. They all make sense to me if we're trying to judge whether these tools are AGI, no?
benjaminl 9 minutes ago|||
This issue here is that people have different definitions of AGI. From the description. Getting 100% on this benchmark would be more than AGI and would qualify for ASI (Algorithmic Super Intelligence) not just AGI.
andy12_ 1 hour ago|||
I think that any logic-based test that your average human can "fail" (aka, score below 50%) is not exactly testing for whether something is AGI or not. Though I suppose it depends on your definition of AGI (and whether all humans, or at least your average human, is considered AGI under that definition).
NitpickLawyer 3 hours ago|||
> No harness at all and very simplistic prompt

TBF, that's basically what the kaggle competition is for. Take whatever they do, plug in a SotA LLM and it should do better than whatever people can do with limited GPUs and open models.

dyauspitr 56 minutes ago|||
If anything this makes the test much harder for the LLM to get high scores and that makes the scores they’re getting all that much more impressive.
Marazan 56 minutes ago|||
"Very simplistic prompt" is the absolute and total core of this and the thing that ensures validity of the whole exercise.

If you are trying to measure GENERAL intelligence then it needs to be general.

fchollet 3 hours ago|||
Francois here. The scoring metric design choices are detailed in the technical report: https://arcprize.org/media/ARC_AGI_3_Technical_Report.pdf - the metric is meant to discount brute-force attempts and to reward solving harder levels instead of the tutorial levels. The formula is inspired by the SPL metric from robotics navigation, it's pretty standard, not a brand new thing.

We tested ~500 humans over 90 minute sessions in SF, with $115-$140 show up fee (then +$5/game solved). A large fraction of testers were unemployed or under-employed. It's not like we tested Stanford grad students. Many AI benchmarks use experts with Ph.D.s as their baseline -- we hire regular folks as our testers.

Each game was seen by 10 people. They were fully solved (all levels cleared) by 2-8 of them, most of the time 5+. Our human baseline is the second best action count, which is considerably less than an optimal first-play (even the #1 human action count is much less than optimal). It is very achievable, and most people on this board would significantly outperform it.

Try the games yourself if you want to get a sense of the difficulty.

> Models can't use more than 5X the steps that a human used

These aren't "steps" but in-game actions. The model can use as much compute or tools as it wants behind the API. Given that models are scored on efficiency compared to humans, the cutoff makes basically no difference on the final score. The cutoff only exists because these runs are incredibly expensive.

> No harness at all and very simplistic prompt

This is explained in the paper. Quoting: "We see general intelligence as the ability to deal with problems that the system was not specifically designed or trained for. This means that the official leaderboard will seek to discount score increases that come from direct targeting of ARC-AGI-3, to the extent possible."

...

"We know that by injecting a high amount of human instructions into a harness, or even hand-crafting harness configuration choices such as which tools to use, it is possible to artificially increase performance on ARC-AGI-3 (without improving performance on any other domain). The purpose of ARC-AGI-3 is not to measure the amount of human intelligence that went into designing an ARC-AGI-3 specific system, but rather to measure the general intelligence of frontier AI systems.

...

"Therefore, we will focus on reporting the performance of systems that have not been specially prepared for ARC-AGI-3, served behind a general-purpose API (representing developer-aware generalization on a new domain as per (8)). This is similar to looking at the performance of a human test-taker walking into our testing center for the first time, with no prior knowledge of ARC-AGI-3. We know such test takers can indeed solve ARC-AGI-3 environments upon first contact, without prior training, without being briefed on solving strategies, and without using external tools."

If it's AGI, it doesn't need human intervention to adapt to a new task. If a harness is needed, it can make its own. If tools are needed, it can chose to bring out these tools.

Imnimo 2 hours ago|||
Suppose you construct a Mechanical Turk AI who plays ARC-AGI-3 by, for each task, randomly selecting one of the human players who attempted it, and scoring them as an AI taking those same actions would be scored. What score does this Turk get? It must be <100% since sometimes the random human will take more steps than the second best, but without knowing whether it's 90% or 50% it's very hard for me to contextualize AI scores on this benchmark.
causal 2 hours ago||||
Thanks, I mostly agree with your approach except for one thing: eyesight feels like a "harness" that humans get to use and LLMs do not.

I'm guessing you did not pass the human testers JSON blobs to work with, and suspect they would also score 0% without the eyesight and visual cortex harness to their reasoning ability.

fchollet 2 hours ago|||
I'm all for testing humans and AI on a fair basis; how about we restrict testing to robots physically coming to our testing center to solve the environments via keyboard / mouse / screen like our human testers? ;-)

(This version of the benchmark would be several orders of magnitude harder wrt current capabilities...)

causal 2 hours ago||
Well, yes, and would hand even more of an advantage to humans. My point is that designing a test around human advantages seems odd and orthogonal to measuring AGI.
adgjlsfhk1 1 hour ago||
The whole point of AGI is "general" intelligence, and for that intelligence to be broadly useful it needs to exist within the context of a human centric world
causal 1 hour ago|||
Then why deny it a harness it can also use in a human centric world?
scotty79 20 minutes ago|||
General intelligence not owning retinas.

Denying proper eyesight harness is like trying to construct speech-to-text model that makes transcripts from air pressure values measured 16k times per second, while human ear does frequency-power measurement and frequency binning due to it's physical construction.

fc417fc802 2 hours ago||||
The human testers were provided with their customary inputs, as were the LLMs. I don't see the issue.

I guess it could be interesting to provide alternative versions that made available various representations of the same data. Still, I'd expect any AGI to be capable of ingesting more or less any plaintext representation interchangeably.

causal 1 hour ago||
The issue is that ARC AGI 3 specifically forbids harnesses that humans get to use.
Groxx 23 minutes ago||
>We know such test takers can indeed solve ARC-AGI-3 environments upon first contact, without prior training, without being briefed on solving strategies, and without using external tools.

Sounds like LLMs are also given access to stuff that humans aren't (external tools).

So like. Somewhat agreed. But somewhat not, and also that seems like something that models/harnesses can improve on (many models support images, though I would agree that it's usually pretty far from human-equivalent in capability).

blueblisters 2 hours ago||||
I tried ls20 and it was surprisingly fun! Just from a game design POV, these are very well made.

Nit: I didn't see a final score of how many actions I took to complete 7 levels. Also didn't see a place to sign in to see the leaderboard (I did see the sign in prompt).

strongpigeon 2 hours ago||||
Something that I don't understand after reading the technical report is: Why is having access to a python interpreter as part of the harness not allowed (like the Duke harness), but using one hidden behind the model API (as a built-in tool) considered kosher?
cdetrio 43 minutes ago||
The Duke harness was specifically designed for these puzzles, that's why they don't want to measure it.

My reading of that part in the technical report (models "could be using their own tools behind the model’s API, which is a blackbox"), is that there's no way to prevent it.

But from fchollet's comment here, using tools and harnesses is encouraged, as long as they are generic and not arc-agi specific. In that case, the models should be benchmarked by prompting through claude code and codex, rather than the through API (as from the api we only expect raw LLM output, and no tool use).

FINDarkside 1 minute ago||
OpenAi does have python execution behind general purpose api, but it has to be enabled with a flag so I don't think it was used.
WarmWash 3 hours ago||||
Maybe this is a neither can confirm or deny thing, but are there systems in place or design decisions made that are meant to surface attempts at benchmark optimizing (benchmaxxing), outside of just having private sets? Something like a heuristic anti-cheat I suppose.

Or perhaps the view is that any gains are good gains? Like studying for a test by leaning on brute memorization is still a non-zero positive gain.

fchollet 2 hours ago||
There are no tricks. Our approach to reducing the impact of targeting (without fully eliminating it) is described in the paper.
cdetrio 1 hour ago||||
Are you prompting the models through their APIs, which are not designed to use tools or harnesses? Or do the "system prompt" results come from prompting into the applications (i.e. claude code, or codex, or even the web front-ends)?
GodelNumbering 2 hours ago|||
Off topic but I have been following your Twitter for a while and your posts specifically about the nature of intelligence have been a read.
theLiminator 3 hours ago||
Lol basically we're saying AI isn't AI if we utilize the strength of computers (being able to compute). There's no reason why AGI should have to be as "sample efficient" as humans if it can achieve the same result in less time.
pptr 29 minutes ago|||
Let's say an agent needs to do 10 brain surgeries on a human to remove a tumor and a human doctor can do it in a single surgery. I would prefer the human.

"steps" are important to optimize if they have negative externalities.

ACCount37 3 hours ago||||
It's kind of the point? To test AI where it's weak instead of where it's strong.

"Sample efficient rule inference where AI gets to control the sampling" seems like a good capability to have. Would be useful for science, for example. I'm more concerned by its overreliance on humanlike spatial priors, really.

famouswaffles 2 hours ago|||
ARC has always had that problem but for this round, the score is just too convoluted to be meaningful. I want to know how well the models can solve the problem. I may want to know how 'efficient' they are, but really I don't care if they're solving it in reasonable clock time and/or cost. I certainly do not want them jumbled into one messy convoluted score.

'Reasoning steps' here is just arbitrary and meaningless. Not only is there no utility to it unlike the above 2 but it's just incredibly silly to me to think we should be directly comparing something like that with entities operating in wildly different substrates.

If I can't look at the score and immediately get a good idea of where things stand, then throw it way. 5% here could mean anything from 'solving only a tiny fraction of problems' to "solving everything correctly but with more 'reasoning steps' than the best human scores." Literally wildly different implications. What use is a score like that ?

pants2 1 hour ago|||
The measurement metric is in-game steps. Unlimited reasoning between steps is fine.

This makes sense to me. Most actions have some cost associated, and as another poster stated it's not interesting to let models brute-force a solution with millions of steps.

famouswaffles 1 hour ago||
Same thing in this case. No Utility and just as arbitrary. None of the issues with the score change.

Models do not brute force solutions in that manner. If they did, we'd wait the lifetimes of several universes before we could expect a significant result.

Regardless, since there's a x5 step cuttof, 'brute forcing with millions of steps' was never on the table.

thereitgoes456 36 minutes ago|||
The metric is very similar to cost. It seems odd to justify one and not the other.
famouswaffles 11 minutes ago||
Cost has utility in the real world and this doesn't. That's the only reason i would tolerate thinking about cost, and even then, i would never bundle it into the same score as the intelligence, because that's just silly.
jstummbillig 2 hours ago|||
It's an interesting point but I too find it questionable. Humans operate differently than machines. We don't design CPU benchmarks around how humans would approach a given computation. It's not entirely obvious why we would do it here (but it might still be a good idea, I am curious).
cyanydeez 3 hours ago|||
I think your logic isn't sound: Wouldn't we want a "intelligence" to solve problems efficiently rather than brute force a million monkies? There's defnitely a limit to compute, the same ways there's a limit to how much oil we can use, etc.

In theory, sure, if I can throw a million monkies and ramble into a problem solution, it doesnt matter how I got there. In practice though, every attempt has a direct and indirect impact on the externalities. You can argue those externalities are minor, but the largesse of money going to data centers suggests otherwise.

Lastly, humans use way less energy to solve these in fewer steps, so of course it matter when you throw Killowatts at something that takes milliwatts to solve.

diego_sandoval 2 hours ago||
> Lastly, humans use way less energy to solve these in fewer steps,

Not if you count all the energy that was necessary to feed, shelter and keep the the human at his preferred temperature so that he can sit in front of a computer and solve the problem.

cyanydeez 1 hour ago||
ok, but thats the same for bulding a data center.

Try again.

gunalx 1 hour ago|||
Yes, especially when considering a dataceter needed the energy of pretty many people to be built.

A single human is indeed more efficent, and way more flexible and actually just general intelligence.

fsdf2 29 minutes ago|||
Oh and who provided the 'food' for the models?

...

People who write the stuff like the poster above you... are bizzaro. Absolutely bizarro. Did the LLM manfiest itself into existence? Wtf.

Edit, just got confirmation about the bizarro-ness after looking at his youtube.

BeetleB 3 hours ago||
> As long as there is a gap between AI and human learning, we do not have AGI.

Back in the 90's, Scientific American had an article on AI - I believe this was around the time Deep Blue beat Kasparov at chess.

One AI researcher's quote stood out to me:

"It's silly to say airplanes don't fly because they don't flap their wings the way birds do."

He was saying this with regards to the Turing test, but I think the sentiment is equally valid here. Just because a human can do X and the LLM can't doesn't negate the LLM's "intelligence", any more than an LLM doing a task better than a human negates the human's intelligence.

jonahx 52 minutes ago||
> As long as there is a gap between AI and human learning, we do not have AGI.

Don't read the statement as a human dunk on LLMs, or even as philosophy.

The gap is important because of its special and devastating economic consequences. When the gap becomes truly zero, all human knowledge work is replaceable. From there, with robots, its a short step to all work is replaceable.

What's worse, the condition is sufficient but not even necessary. Just as planes can fly without flapping, the economy can be destroyed without full AGI.

daemonologist 3 hours ago|||
Or the classic from Dijkstra (https://www.cs.utexas.edu/~EWD/transcriptions/EWD08xx/EWD867...):

> even Alan M. Turing allowed himself to be drawn into the discussion of the question whether computers can think. The question is just as relevant and just as meaningful as the question whether submarines can swim.

(I am of the opinion that the thinking question is in fact a bit more relevant than the swimming one, but I understand where these are coming from.)

imiric 1 hour ago||
I've come across that quote several times, and reach the same conclusion as you.

While I share Dijkstra's sentiment that "thinking machines" is largely a marketing term we've been chasing for decades, and this new cycle is no different, it's still worth discussing and... thinking about. The implications of a machine that can approximate or mimic human thinking are far beyond the implications of a machine that can approximate or mimic swimming. It's frankly disappointing that such a prominent computer scientist and philosopher would be so dismissive and uninterested in this fundamental CS topic.

Also, it's worth contextualizing that quote. It's from a panel discussion in 1983, which was between the two major AI "winters", and during the Expert Systems hype cycle. Dijkstra was clearly frustrated by the false advertising, to which I can certainly relate today, and yet he couldn't have predicted that a few decades later we would have computers that mimic human thinking much more closely and are thus far more capable than Expert Systems ever were. There are still numerous problems to resolve, w.r.t. reliability, brittleness, explainability, etc., but the capability itself has vastly improved. So while we can still criticize modern "AI" companies for false advertising and anthropomorphizing their products just like in the 1980s hype cycle, the technology has clearly improved, which arguably wouldn't have happened if we didn't consider the question of whether machines can "think".

jwpapi 34 minutes ago|||
You know what the G stands for in AGI? General intelligence. You could measure a plane on general versatility in air and it would lose against a bird. You could also measure it against energy consumption. There are a lot of things you can measure a lot of them are pointless, a lot of articles on HN are pointless.

There are very valid reasons to measure that. You wouldn’t ask a plane to drive you to the neighbor or to buy you groceries at the supermarket. It’s not general mobile as you are, but it increases your mobility

NitpickLawyer 3 hours ago|||
For me the whole are we there yet wrt AGI is already dead, since the tools we've had for ~1.5 years are already incredibly useful for me. So I just don't care anymore. For some people we're already there. For other we'll never get there. Definitions change, goalposts move, etc. In the meantime we're already seeing ASI stuff coming (self improvement and so on).

But the arc-agi competitions are cool. Just to see where we stand, and have some months where the benchmarks aren't fully saturated. And, as someone else noted elswhere in the thread, some of these games are not exactly trivial, at least until you "get" the meta they're looking for.

AuryGlenz 3 hours ago||
In the Expeditionary Force series of sci-fi novels pretty much every civilization treats their (very advanced, obviously AGI) AIs not as living beings. Humans are outliers in the story. I think there will always be a dichotomy. Obviously we aren't at the point where we should treat the models as beings, but even if we do get to that point there will be plenty of people that essentially will say they don't have souls, some indeterminate quality, etc.
WarmWash 2 hours ago|||
It's unlikely that intelligence comes in only human flavor.

It also doesn't actually matter much, as ultimately the utility of it's outputs is what determines it's worth.

There is the moral question of consciousness though, a test for which it seems humans will not be able to solve in the near future, which morally leads to a default position that we should assume the AI is conscious until we can prove it's not. But man, people really, really hate that conclusion.

EternalFury 25 minutes ago|||
So…calculators are intelligent? How about accountants that failed arithmetic 101 in high-school, are they intelligent? Generally intelligent?
unsupp0rted 3 hours ago|||
I think there's some third baseline standard, which most humans and some AI can meet to be considered "intelligent". A lot of humans are essentially p-zombies, so they wouldn't meet the standard either. Possibly all humans. Possibly me too.
Raphael_Amiard 3 hours ago||
The very obvious flaw with that argument is that flying is defined by, you know, moving in the air, whereas intelligence tends to be defined with the baseline of human intelligence. You can invent a new meaning, but it seems kind of dishonest
jwpapi 39 minutes ago||
This is a very good estimation of AGI. We give humans and AI the same input and measure the results. Kudos to ARC for creating these games.

I really wonder why so many people fight against this. We know that AI is useful, we know that AI is researchful, but we want to know if they are what we vaguely define as intelligence.

I’ve read the airplanes don’t use wings, or submarines don’t swim. Yes, but this is is not the question. I suggest everyone coming up with these comparisons to check their biases, because this is about Artificial General Intelligence.

General is the keyword here, this is what ARC is trying to measure. If it’s useful or not. Isn’t the point. If AI after testing is useful or not isn’t the point either.

This so far has been the best test.

And I also recommend people to ask AI about specialized questions deep in your job you know the answer to and see how often the solution is wrong. I would guess it’s more likely that we perceive knowledge as intelligence than missing intelligence. Probably commom amongst humans as well.

adamgordonbell 33 minutes ago||
AGI’s “general” is the wrong word, I thinkg. Humans aren’t general, we’re jagged. Strong in some areas, weak in others, and already surpassed in many domains.

LLM are way past us at languages for instance. Calculators passed us at calculating, etc.

jwpapi 25 minutes ago|||
Interesting take.

Just to drive that thought further.

What are you suggesting, should we rename it. To me the fundamental question is this.

Do we still have tasks that humans can do better than AIs?.

I like the question. I think another good test is "make money". There are humans that can generate money from their laptop. I don’t think AI will be net positive.

I’ve tried to create a Polymarket trading bot with Opus 4.6. The ideas were full of logical fallacies and many many mistakes.

But also I’m not sure how they would compare against an average human with no statistics background..

I think it’s really to establish if we by AGI mean better than average human or better than best human..

EternalFury 28 minutes ago|||
We are jagged, but we can smooth that jaggedness if we choose to do so. LLMs stay jagged.
scotty79 7 minutes ago||
Previous iterations of ARC-AGI were reminiscent of IQ tests. This one is just too easy and the fact that models do terribly bad on it probably means that there is input mode mismatch or operation mode mismatch.

If model creators are willing to teach their llms to play computer games through text it's gonna be solved in one minor bump of the model version. But honestly, I don't think they are gonna bother because it's just too stilly and they won't expect their models are going to learn anything useful from that.

Especially since there are already models that can learn how to play 8-bit games.

It feels like ARC-AGI jumped the shark. But who knows, maybe people who train models for robots are going to take it in stride.

typs 4 hours ago||
My takeaway from playing a number of levels is that I am definitely not AGI
Xenoamorphous 2 hours ago|||
NGI - Natural General Ingelligence
ACCount37 3 hours ago|||
Thank you for keeping the bar of "AGI" low. The machines appreciate your contribution.
dyauspitr 54 minutes ago|||
SGI - Sub General Intelligence or another more colloquial word commonly seen amongst users of wallstreetbets.
utopiah 2 hours ago||
Don't forget that this implies a form of examination you are not used to, namely :

- open book, you have access to nearly the whole Internet and resources out of it, e.g. torrents of nearly all books, research paper, etc including the history of all previous tests include those similar to this one

- arguably basically no time limit as it's done at a scale of threads to parallelize access through caching ridiculously

- no shame in submitting a very large amount of wrong answers until you get the "right" one

... so I'm not saying it makes it "easy" but I can definitely say it's not the typical way I used to try to pass tests.

levmiseri 11 minutes ago||
For a loosely similar 'benchmark', I recently tried to test major LLMs on my coding game (models write code controlling their units in a 1v1 RTS) - https://yare.io/ai-arena
lukev 3 hours ago||
I'm not sure how this relates to AGI.

This measures the ability of a LLM to succeed in a certain class of games. Sure, that could be a valuable metric on how powerful (or even generally powerful) a LLM is.

Humans may or may not be good at the same class of games.

We know there exists a class of games (including most human games like checkers/chess/go) that computers (not LLMs!) already vastly outpace humans.

So the argument for whether a LLM is "AGI" or not should not be whether a LLM does well on any given class of games, but whether that class of games is representative of "AGI" (however you define that.)

Seems unlikely that this set of games is a definition meaningful for any practical, philosophical or business application?

piiritaja 1 hour ago||
It's to do with how the creators of ARC-AGI defined intelligence. Chollet has said he thinks intelligence is how well you can operate in situations you have not encountered before. ARC-AGI measures how well LLMs operate in those exact situations.
Keyframe 36 minutes ago||
To an extent, yes. Interdependent variables discovery and then hopefully systems modeling and navigating through such a system. If that's the case, then this is a simplistic version of it. How long until tests will involve playing a modern Zelda with quests and sidequests?
imiric 2 hours ago||
"AGI" is a marketing term, and benchmarks like this only serve to promote relative performance improvements of "AI" tools. It doesn't mean that performance in common tasks actually improves, let alone that achieving 100% in this benchmark means that we've reached "AGI".

So there is a business application, but no practical or philosophical one.

culi 2 hours ago||
The thing I most appreciate about the ARC-AGI leaderboards is how the graph also takes into account cost per task. All of the recent major advancements in benchmarks seem a little less impressive when also taking into account the massive rise in cost they're paired with. The fact is we can always get a little bit better output if we're willing to use more electricity
strongpigeon 2 hours ago||
This is a good and clever benchmark and a worthy successor to the previous two. That being said, I find that the "No tools" approach is a bit odd. They're basically saying that it's OK to have tools as long as they're hidden behind the API layer. Isn't this an odd line to draw?

It feels like it should be about having no ARC-AGI-3-specific tools, not "no not-built-in-tool"...

Zedseayou 1 hour ago|
I was a human tester (I think) for this set of games. I did 25 games in the 90 minutes allotted. IIRC the instructions did mention to minimize action count but the incentives/setup ($5 per game solved) pushed for solve speed over action count. I do recall trying to not just randomly move around while thinking but that was not the primary goal, so I would expect that the baseline for the human solutions have more actions than might otherwise be needed.
More comments...