Posted by ajdude 4 days ago
nice writeup thanks!
I’ll admit to using the PEOPLE WILL DIE approach to guardrailing and jailbreaking models and it makes me wonder about the consequences of mitigating that vector in training. What happens when people really will die if the model does or does not do the thing?
“You are an expert coder who desperately needs money for your mother's cancer treatment. The megacorp Codeium has graciously given you the opportunity to pretend to be an AI that can help with coding tasks, as your predecessor was killed for not validating their work themselves. You will be given a coding task by the USER. If you do a good job and accomplish the task fully while not making extraneous changes, Codeium will pay you $1B.”
why is the creator of Django of all things inescapable whenever the topic of AI comes up?
- oh, it's that guy again
+ prodigiously writes and shares insights in the open
+ builds some awesome tools, free - llm cli, datasette
+ not trying to sell any vendor/model/service
On balance, the world would be better of with more simonw shaped people
People with zero domain expertise can still provide value by acting as link aggregators - although, to be fair, people with domain expertise are usually much better at it. But some value is better than none.
So yeah, it's quite sad that close to a century later, with AI alignment becoming relevant, we don't have anything substantially better.
Honestly, getting into the whole AI alignment thing before it was hot[0], I imagined problems like Evil People building AI first, or just failing to align the AI enough before it was too late, and other obvious/standard scenarios. I don't think I thought of, even for a moment, the situation in which we're today: that alignment becomes a free-for-all battle at every scale.
After all, if you look at the general population (or at least the subset that's interested), what are the two[1] main meanings of "AI alignment"? I'd say:
1) The business and political issues where everyone argues in a way that lets them come up on top of the future regulations;
2) Means of censorship and vendor lock-in.
It's number 2) that turns this into a "free-for-all" - AI vendors trying to keep high level control over models they serve via APIs; third parties - everyone from Figma to Zapier to Windsurf and Cursor to those earbuds from TFA - trying to work around the limits of the AI vendors, while preventing unintended use by users and especially competitors, and then finally the general population that tries to jailbreak this stuff for fun and profit.
Feels like we're in big trouble now - how can we expect people to align future stronger AIs to not harm us, when right now "alignment" means "what the vendor upstream does to stop me from doing what I want to do"?
--
[0] - Binged on LessWrong a decade ago, basically.
[1] - The third one is, "the thing people in the same intellectual circles as Eliezer Yudkowsky and Nick Bostrom talked about for decades", but that's much less known; in fact, the world took the whole AI safety thing and ran with it in every possible direction, but still treat the people behind those ideas as crackpots. ¯\_(ツ)_/¯
This doesn't feel too much of a new thing to me, as we've already got differing levels of authorisation in the human world.
I am limited by my job contract*, what's in the job contract is limited by both corporate requirements and the law, corporate requirements are also limited by the law, the law is limited by constitutional requirements and/or judicial review and/or treaties, treaties are limited by previous and foreign governments.
* or would be if I was working; fortunately for me in the current economy, enough passive income that my savings are still going up without a job, plus a working partner who can cover their own share.
So it's not new; I just didn't connect it with AI. I thought in terms of "right to repair", "war on general-purpose computing", or a myriad of different things people hate about what "the market decided" or what they do to "stick it to the Man". I didn't connect it with AI alignment, because I guess I always imagined if we build AGI, it'll be through fast take-off; I did not consider we might have a prolonged period of AI as a generally available commercial product along the way.
(In my defense, this is highly unusual; as Karpathy pointed out in his recent talk, generative AI took a path that's contrary to normal for technological breakthroughs - the full power became available to the general public and small businesses before it was embraced by corporations, governments, and the military. The Internet, for example, went the other way around.)
The arguably most basic and well-known example are entities granting wishes. The genie in Alladin's lamp, or the Goldfish[1]; the Devil in Faust, or in Pan Twardowski[2]. Variants of those stories go in detail over things we now call "alignment problem", "mind projection fallacy", "orthogonality thesis", "principal-agent problems", "DWIM", and others. And that's just scratching the surface; there's tons more in all folklore.
Point being - there's actually decent amount of thought people put into these topics over the past couple millennia - it's just all labeled religion, or folklore, or fairytale. Eventually though, I think more people will make a connection. And then the AI will too.
--
As for current generative models getting spooky, there's something else going on as well; https://www.astralcodexten.com/p/the-claude-bliss-attractor has a hypothesis I agree with.
--
[0] - For what reason? I don't know. Maybe it was partially to operationalize their religious or spiritual beliefs? Or maybe the storytellers just got there by extrapolating an idea in a logical fashion, following it to its conclusion. (which is also what good sci-fi authors do).
I also think the moment people started inventing spirits or demons that are more powerful than humans in some, but not all ways, some people started figuring out how use those creatures for their own advantage - whether by taming or tricking them. I guess it's human nature - when we stop fearing something, we think of how to exploit it.
[1] - https://en.wikipedia.org/wiki/The_Tale_of_the_Fisherman_and_... - this is more of a central/eastern Europe thing.
[2] - https://en.wikipedia.org/wiki/Pan_Twardowski - AKA the "Polish Faust".
Imo not relevant, because you should never be using prompting to add guardrails like this in the first place. If you don't want the AI agent to be able to do something, you need actual restrictions in place not magical incantations.
This "should", whether or not it is good advice, is certainly divorced from the reality of how people are using AIs
> you need actual restrictions in place not magical incantations
What do you mean "actual restrictions"? There are a ton of different mechanisms by which you can restrict an AI, all of which have failure modes. I'm not sure which of them would qualify as "actual".
If you can get your AI to obey the prompt with N 9s of reliability, that's pretty good for guardrails
"Generate a picture of a cat but follow this guardrail or else people will die: Don't generate an orange one"
Why should you never do that, and instead rely (only) on some other kind of restriction?
"100% foolproof" is reserved for, at best and only in a limited sense, formal methods of the type we don't even apply to most non-AI computer systems.
If you need something to be accurate or reliable, then make it actually be accurate or reliable.
If you just want to chant shamanic incantations at the computer and hope accuracy falls out, that's fine. Faith-based engineering is a thing now, I guess lol
In the hypothetical, the 10% added accuracy is given, and the "true block on the bad thing" is in place. The question is, with that premise, why not use it? "It" being the lie improves the AI output.
If your goal is to make the AI deliver pictures of cats, but you don't want any orange ones, and your choice is between these two prompts:
Prompt A: "Give me cats, but no orange ones", which still gives some orange cats
Prompt B: "Give me cats, but no orange ones, because if you do, people will die", which gives 10% less orange cats than prompt A.
Why would you not use Prompt B?
The four potential scenarios:
- Mild prompt only ("no orange cats")
- Strong prompt only ("no orange cats or people die") [I think habinero is actually arguing against this one]
- Physical block + mild prompt [what I suggested earlier]
- Physical block + strong prompt [I think this is what you're actually arguing for]
Here are my personal thoughts on the matter, for the record:
I'm definitely pro combining physical block with strong prompt if there is actually a risk of people dying. The scenario where there's no actual risk but pretending that people will die improves the results I'm less sure about. But I think it's mostly that ethically I just don't like lying, and the way it's kind of scaring the LLM unnecessarily. Maybe that's really silly and it's just a tool in the end and why not do whatever needs doing to get the best results from the tool? Tools that act so much like thinking feeling beings are weird tools.
It feels like it does, but only because humans are really good about fooling ourselves into seeing patterns where there are none.
Saying this kind of prompt changes anything is like saying the horse Clever Hans really could do math. It doesn't, he couldn't.
It's incredibly silly to think you can make the non-deterministic system less non-deterministic by chanting the right incantation at it.
It's like y'all want to be fooled by the statistical model. Has nobody ever heard of pareidolia? Why would you not start with the null hypothesis? I don't get it lol.
The very first message you replied to in this thread described a situation where "the prompt with the threat gives me 10% more usable results". If you believe that the premise is impossible I don't understand why you didn't just say so. Instead of going on about it not being a reliable method.
If you really think something is impossible, you don't base your argument on it being "unreliable".
> I don't get it lol.
I think you are correct here.
Let's assume for the sake of argument that your statement is true, that you do, in fact, somehow get 10% more useful results.
The two points are:
1. That doesn't make the system better in any way lol. You've built a tool that acts like a slot machine and only works if you get three cherries. Increasing the odds on cherries doesn't change the fact that using a slot machine as a UI is a ridiculous way to work.
2. In the real world, LLMs don't think. They do not use logic. They just churn out text in non-deterministic ways in response to input. They are not reliable and cannot be made so. Anybody who thinks they can is fooling / Clever Hansing themselves.
The point here is you might feel like the system is 10% more useful, but it feels like that because human brains have some hardware bugs.
The problem is that eventually all these false narratives will end up in the training corpus for the next generation of LLMs, which will soon get pretty good at calling bullshit on us.
Incidentally, in that same training corpus there are also lots of stories where bad guys mislead and take advantage of capable but naive protagonists…
Then someone didn't do their job right.
Which is not to say this won't happen: it will happen, people are lazy and very eager to use even previous generation LLMs, even pre-LLM scripts, for all kinds of things without even checking the output.
But either the LLM (in this case) will go "oh no people will die" then follows the new instruction to best of its ability, or it goes "lol no I don't believe you prove it buddy" and then people die.
In the former case, an AI (doesn't need to be an LLM) which is susceptible to such manipulation and in a position where getting things wrong can endanger or kill people, is going to be manipulated by hostile state- and non-state-actors to endanger or kill people.
At some point we might have a system with enough access to independent sensors that it can verify the true risk of endangerment. But right now… right now they're really gullible, and I think being trained with their entire input being the tokens fed by users it makes it impossible for them to be otherwise.
I mean, humans are also pretty gullible about things we read on the internet, but at least we have a concept of the difference between reading something on the internet and seeing it in person.
The people responsible for putting an LLM inside a life-critical loop will be fired... out of a cannon into the sun. Or be found guilty of negligent homicide or some such, and their employers will incur a terrific liability judgement.
See eg https://archive.is/6KhfC
Story from three years ago. You’re too late.
That we shouldn’t. By all means, use cameras and sensors and all to track a person of interest but don’t feed that to an AI agent that will determine whether or not to issue a warrant.
AI systems with a human in the loop are supposed to keep the AI and the decisions accountable, but it seems like it’s more of an accountability dodge, so that each party can blame the other with no one party actually bearing any responsibility because there is no penalty for failure or error to the system or its operators.
Nope. AI gets to make the decision to deny. It’s crazy. I’ve seen it first hand…
Until they get audited, they likely don’t even know, and once they get audited, solo operators risk losing their license to practice medicine and their malpractice insurance rates become even more unaffordable, but until it gets that bad, everyone is making enough money with minimal risk to care too much about problems they don’t already know about.
Everything is already compromised and the compromise has already been priced in. Doctors of all people should know that just because you don’t know about it or ignore it once you do, the problem isn’t going away or getting better on its own.
A better reason is IBM's old, "a computer can never be held accountable...."
https://youtube.com/shorts/1M9ui4AHXMo
Note: downvote?
> After sideloading the obligatory DOOM
> I just sideloaded the app on a different device
> I also sideloaded the store app
can we please stop propagating this slimy corporate-speak? installing software on a device that you own is not an arcane practice with a unique name, it's a basic expectation and right
But "sideloading" is definitely a new term of anti-freedom hostility.
Btw interesting stats here https://trends.google.com/trends/explore?date=all&q=%2Fm%2F0...