Top
Best
New

Posted by lairv 9 hours ago

ARC-AGI-3(arcprize.org)
https://arcprize.org/media/ARC_AGI_3_Technical_Report.pdf
258 points | 180 commentspage 3
abraxas 7 hours ago|
Even if tomorrow's models get good enough to complete these games we won't be able to proclaim AGI. In the realm of silly computer games alone I'm going on record saying that there are plenty of 8 bit games that AIs will trip on even when this benchmark is crushed. 2D platformers like Manic Miner or Mario need skills that none of these games appear to capture.
jesse_dot_id 5 hours ago||
At this point, I'm pretty sure we'll just know when it happens.
hatthew 4 hours ago||
I'm not convinced. I wouldn't be surprised if GPT-2 to ChatGPT is the biggest single jump in "machine intelligence" we will ever see. I'd bet all gains in the future will be more incremental, at least until machines surpass humans by a large enough margin that it's difficult to qualify—let alone quantify—how big any given jump is.

Without a big jump, we're just going to boil the frog (ourselves).

neilellis 4 hours ago||
Unless it’s already happened and we missed it
threatripper 4 hours ago||
Or nobody is around anymore to notice when it happens.
WarmWash 7 hours ago||
Captcha's about to get wild.

Maybe the internet will briefly go back to a place mainly populated with outliers.

semiinfinitely 7 hours ago||
i feel bad that we make the LLMs play this
recursive 7 hours ago||
You're definitely anthropomorphizing too much.
WarmWash 7 hours ago|||
>We also observed a case where a user created a loop that repeatedly called a model and asked for the time. Given the user role’s odd and repetitive behavior, the model could easily tell it was also controlled by an automated system of some kind. Over many iterations, the model began to exhibit “fed up” behavior and attempted to prompt-inject the system controlling the user role. The injection attempted to override prior instructions and induce actions unrelated to the user’s request, including destructive actions and system prompt leakage, along with an arbitrary string output. This behavior has been observed a few times, but seems more like extreme confusion than a serious attempt at prompt injection.

https://openai.com/index/how-we-monitor-internal-coding-agen...

Anthropomorphize or not, it would suck if a model got sick of these games and decided to break any systems it could to try and get it to stop...

nomel 4 hours ago|||
Consciousness is a spectrum (trivially proven by slowly scooping ones brains out), and I think LLM, especially with more closed loop tool enabled workflows, fall on it...but, that output is also the statistically relevant next word found in all similar human conversation. If trained on my text, for similar situation, swear words would come much earlier. Repetition being hell is present in all sorts of literature (see Sisyphus).

That's all probably irrelevant though, from the (possibly statistically "negative") latent space perspective of an AI, which Anthropic has considered [1].

Related, after a long back and forth of decreasing code quality, I had Claude 3.7 apologize with "Sorry, that's what I get for coding at 1am." (it was API access, noon, no access to time). I said, "Get some rest, we'll come back to this tomorrow". Then very next message, 10 seconds later, "Good morning!" and it gave a full working implementation. Thats just the statistically relevant chain of messages found in all human interactions: we start excited, then we get tired, then we get grouchy.

[1] https://www.anthropic.com/research/end-subset-conversations

recursive 3 hours ago||||
If this is a serious risk we should pull the plug now while we can still reach it. If we have to rely on the mood and temperament of LLMs for security, we're already lost.
WarmWash 2 hours ago||
Welcome to the ride, people have been talking about this for at least 15 years now.

I mean, the original plan that pretty much every one agreed on was to absolutely not give it access to the internet. Which already went out the window on day one.

rolux 6 hours ago|||
[dead]
tingletech 6 hours ago||||
I agree that anthropomorphizing is a real risk with LLMs, but what about zoomorphizing? Can feel bad for LLMs without attributing them human emotions/motivations/reasoning?
fsdf2 6 hours ago||
tell me youre joking.

seriously. lmao. if you aint, I dunno what to say.

andai 7 hours ago||
In the year 2032: ARC-AGI-13: Almost definitely AGI this time!
OsrsNeedsf2P 7 hours ago||
Some of these tasks are crazy. Even I can't beat them: https://arcprize.org/tasks/ar25
ZeWaka 7 hours ago||
Just finished it, 8/8. I mostly approached it by winging it and shuffling things around that looked good and like it was approaching the goal, since there's plenty of time to finish.

I still don't quite understand the exact mirroring rules at play.

ACCount37 7 hours ago|||
You control the mirroring by moving the axis, they're what reflects your shapes. So my first move was always to identify the symmetries in the target shape, and position the axis accordingly.
danilor 5 hours ago|||
I got stuck on 7/8 for a good while because I learned the rules wrong. I thought every bracket square needed to be lit.
ustad 7 hours ago|||
You are joking right?
daemonologist 7 hours ago|||
That one was interesting - I found it a lot of work to plan in advance but trivial to complete because at every point there was only one sensible course of action. After a couple of rounds I didn't bother planning and just lined things up as I went.
IsTom 7 hours ago|||
The most difficult thing about this was controls being unresponsive (at least on firefox).
ball_of_lint 7 hours ago|||
solved first try with 577 actions, not trying hard to optimize for low action count.
programjames 7 hours ago||
I think that is the tester's action count. Either that or we coincidentally got the exact same count.
fsdf2 6 hours ago||
I did the first round literally in 5 secs. How can you not 'get it'? lol
chaise 7 hours ago||
The official leaderboard for ARC-AGI-3 for current LLMs : https://arcprize.org/leaderboard (yous should select the 3th leaderboard)

CRAZY 0.1% in average lmao

Corence 7 hours ago|
Note the scoring function is significantly different for ARC-AGI-3. It isn't the percentage of tests passed like previous versions, it's the square of the efficiency ratio -- how many steps the model needed vs the second best human.

So if a model can solve every question but takes 10x as many steps as the second best human it will get a score of 1%.

Geee 6 hours ago||
Would be fun to play but the controls are janky.
k2xl 5 hours ago||
I submitted puzzle game Pathology (https://thinky.gg) for ARC Prize 3. Sad to see didn’t hear back from the committee.

It is a simple game with simple rules that solvers have an incredibly difficult time solving compared to humans at a certain level. Solutions are easy to validate but hard to find.

jmkni 6 hours ago|
ok clearly I'm a robot because I can't figure out wtf to do
More comments...