Where the goblins came from

Posted by ilreb 9 hours ago

719 points | 407 commentspage 2

flancian 2 hours ago|

Wait, did I get this right that the answer after all the investigation that showed they had set up a goblin-reinforcing loop during fine tuning was... to ask it to not mention goblins so much in the system prompt?!

romaniitedomum 4 hours ago||

Can you imagine a knowledge worker from the 1950s, say a clerk or a marketer, being magically transported into our time and dropped into a meeting like a morning standup, where people talk about how they spent their time stopping the artificial intelligence from talking about goblins so much? Hell, even when I was an IT student back in the 90s, people from my parents' generation struggled to grasp what it was that I was doing. Now, the disconnect is so vast that the mind reels.

albert_e 9 hours ago||

If a tiny misconfiguration of reward system can cause such noticeable annoyance ...

What dangers lurk beneath the surface.

This is not funny.

andai 9 hours ago||

For every gremlin spotted, many remain unseen...

TychoCelchuuu 8 hours ago||

This is a worry that people have been talking about in various forms for a while now, and I think it's a gigantic one. The only reason this was caught is that the quirk was a very noticeable verbal one. When words like "goblin" and "gremlin" pop up it is easy for us to spot. If the quirk takes another shape (say, ranking certain people with certain features as less trustworthy) it might be too subtle or too weird for us to notice it. Would I ever notice if ChatGPT consistently rates people born in June to be untrustworthy?

Here is an academic paper discussing this kind of worry: https://link.springer.com/article/10.1007/s11023-022-09605-x

Tenoke 5 hours ago||

A great example of how current alignment is imperfect and bound to miss random behaviors nobody is trying to get.

This is cute now, and a huge problem when future AI does everything and is responsible for problems it isn't even directly optimized for. Who knows what quirks would arise then.

m0rde 1 hour ago||

New technology isn't perfect now -> drop technology and never use it in the future

InfiniteRand 4 hours ago|||

I think eventually you are going to end up with every smart AI continually checked by dumber AI's to make sure they don't do anything too crazy. Which probably does bring AI closer to how human intelligence works

weitendorf 4 hours ago||

Completely agree, top down “alignment” and RLHF is actually quite primitive and uses a lot fancy words to describe what is essentially just hitting the machine with a stick without the nuance, context, or feedback to help it model why the feedback was given.

Also to be honest I think OpenAI models struggle a lot with this, I primarily stopped using them in the sycophancy/emoji era but ever since the way they talk or passive aggressively offer to do something with buzzwords just pisses me off so much. Like I’m constantly being negged by a robot because some SFT optimized for that really strongly to the point it can’t even hold a coherent conversation and this is called “AI safety” when it’s just haphazard data labeling

59nadir 3 hours ago||

I really liked this write-up; this is the type of LLM content that I actually want to read from these people, where they give a window into their world of putting together this odd artifact and we can empathize.

canpan 9 hours ago||

I wondered how is training data balanced? If you put in to much Wikipedia, and your model sounds like a walking encyclopedia?

After doing the Karpathy tutorials I tried to train my AI on tiny stories dataset. Soon I noticed that my AI was always using the same name for its stories characters. The dataset contains that name consistently often.

maxall4 9 hours ago|

At this scale, that kind of thing is not really a problem; you just dump all of the data you can find into the model (pre-training)1. Of course, the pre-training data influences the model, but the reinforcement learning is really what determines the model’s writing style and, in general, how it “thinks” (post-training).

1 This data is still heavily filtered/cleaned

upbeat_general 5 hours ago||

This isn’t quite accurate. Data weighting is quite important in pretraining.

2dvisio 7 hours ago||

I’ve been having consistent issues with it adding Hindi words (just one usually) in the middle of its output. And sounds like other have been having this too, https://news.ycombinator.com/item?id=47832912 I don’t speak Hindi, have never asked it to translate anything in Hindi.

dtech 6 hours ago||

I wonder if a proportionally large amount of RLHF was done by Indians which causes this behavior.

djyde 2 hours ago||

My Claude often starts sleep-talking in Korean suddenly.

SomewhatLikely 7 hours ago||

Checking my history I searched ["chaos goblin" chatgpt] on March 6th after seeing too many goblins and gremlins and didn't find anyone talking about it then. I did have the nerdy personality turned on and in my testing of Chatgpt 5.5 I did notice the nerdy personality was gone because some responses were not considering as many plausible interpretations or covering as many useful answers as the response recorded for 5.4. Rather than having the LLM guess the most plausible interpretation and focus on the most likely answer I prefer a more well-rounded response and if I want less I'll scan. Anyway, after seeing the personality was gone I just added a custom instruction to take on a nerdy persona and got back my desired behavior. But also the gremlins and goblins are back so I don't think their mitigation is strong enough to overcome the personality tuning.

pants2 8 hours ago||

Nice, OpenAI mentioned my HackerNews post in their article :) I appreciate that they wrote a whole blog post to explain!

https://news.ycombinator.com/item?id=47319285

iterateoften 9 hours ago|

This is funny because it’s a silly topic, but I think it shows something extremely seriously wrong with llms.

The goblins stand out because it’s obvious. Think of all the other crazy biases latent in every interaction that we don’t notice because it’s not as obvious.

Absolutely terrifying that OpenAI is just tossing around that such subtle training biases were hard enough to contain it had to be added to system prompt.

ninjagoo 9 hours ago||

> Absolutely terrifying that OpenAI is just tossing around that such subtle training biases were hard enough to contain it had to be added to system prompt.

May I introduce you to homo sapiens, a species so vulnerable to such subtle (or otherwise) biases (and affiliations) that they had to develop elaborate and documented justice systems to contain the fallouts? :)

chongli 9 hours ago|||

We’re really not that vulnerable to such things as a species, because we as individuals all have our own minds and our own sets of biases that cancel out and get lost in the noise. If we all had the exact same bias then it would be a huge problem.

arglebarnacle 8 hours ago|||

I hear you but of course history is full of examples of biases shared across large groups of people resulting in huge human costs.

The analogy isn’t perfect of course but the way humans learn about their world is full of opportunities to introduce and sustain these large correlated biases—social pressure, tradition, parenting, education standardization. And not all of them are bad of course, but some are and many others are at least as weird as stray references to goblins and creatures

Ekaros 4 hours ago||||

Doesn't that depend on the biases in question? Many argue that homogenous societies do many things better. And part of homogeneity is sharing same set of biases.

lifis 3 hours ago||||

And what do you think society/culture is?

It's a set of biases installed in people, whose purpose is mostly to replicate themselves.

Humans are MORE susceptible that LLMs, because LLMs's biases are easily steered to something else, unlike most humans.

ninjagoo 9 hours ago||||

> If we all had the exact same bias then it would be a huge problem.

And may I introduce you to "groupthink" :))

Dylan16807 8 hours ago||

Now imagine that every opinion you have is automatically fully groupthinked and you see the difference/problem with training up a big AI model that has a hundred million users.

The problem does exist when using individual humans but in a much smaller form.

ninjagoo 8 hours ago||

> The problem does exist when using individual humans but in a much smaller form.

And may I introduce you to organized religion :)

Dylan16807 8 hours ago||

That's still a lot smaller!

Make a major religion where everyone is a scifi clone of one person including their memories and then it'll be in the same ballpark of spreading bias.

jychang 8 hours ago|||

> We’re really not that vulnerable to such things as a species, because we as individuals all have our own minds and our own sets of biases that cancel out and get lost in the noise.

[Citation Needed]

Just because if you have a species-wide bias, people within the species would not easily recognize it. You can't claim with a straight face that "we're really not that vulnerable to such things".

For example, I think it's pretty clear that all humans are vulnerable to phone addiction, especially kids.

hbs18 2 hours ago|||

An LLM is a computer program, which isn't a human. You wouldn't excuse a calculator being occasionally wrong because humans sometimes get manual calculations wrong too.

snakebiteagain 8 hours ago|||

Mandatory reading on that topic: www.anthropic.com/research/small-samples-poison

We're probably not noticing a LOT of malicious attempts at poisoning major AI's only because we don't know what keywords to ask (but the scammers do and will abuse it).

tptacek 8 hours ago|||

I think it's extraordinarily telling that people are capable of being reflexively pessimistic in response to the goblin plague. It's like something Zitron would do.

This story is wonderful.

bitexploder 8 hours ago||

I feel at least partially responsible. I would often instruct agents to "stop being a goblin". I really enjoyed this story too, though.

bitexploder 8 hours ago|||

We do not have the complete picture.

ordinarily 9 hours ago||

Doesn't seem that surprising or terrifying to me. Humans come equipped with a lot more internal biases (learned in a fairly similar fashion), and they're usually a lot more resistant to getting rid of them.

The truly terrifying stuff never makes it out of the RLHF NDAs.

Terr_ 9 hours ago|||

We ought to be terrified, when one adjusts for ll the use-cases people are talking about using these algorithms in. (Even if they ultimately back off, it's a lot of frothy bubble opportunity cost.)

There a great many things people do which are not acceptable in our machines.

Ex: I would not be comfortable flying on any airplane where the autopilot "just zones-out sometimes", even though it's a dysfunction also seen in people.

famouswaffles 8 hours ago||

>Ex: I would not be comfortable flying on any airplane where the autopilot "just zones-out sometimes", even though it's a dysfunction also seen in people.

You might if that was the best auto-pilot could be. Have you never used a bus or taken a taxi ?

The vast majority of things people are using LLMs for isn't stuff deterministic logic machines did great at, but stuff those same machines did poorly at or straight up stuff previously relegated to the domains of humans only.

If your competition also "just zones out sometimes" then it's not something you're going to focus on.

agnishom 9 hours ago|||

Humans also take a lot of time in producing output, and do not feed into a crazy accelerationistic feedback loop (most of the time).

More comments...