What happened after 2k people tried to hack my AI assistant

Posted by cuchoi 2 days ago

What happened after 2k people tried to hack my AI assistant(www.fernandoi.cl)

369 points | 160 commentspage 6

saberience 1 day ago|

Basically no one really tried so there is no learning here, which is what I originally predicted.

That is, there was no value to any serious attempt here, just a handful of folks casually sending an email.

Other companies (actual targets) have been hacked via prompt injection.

This is like me offering up my Mac minis public ip to hackers, why would any actually good hacker want to hack my personal Mac mini? (They wouldn’t)

walrus01 1 day ago||

Person DDoSes themselves and then claims success...

Uhhhh....

ChrisRR 1 day ago|

If the service stayed up then there was no denial of service

walrus01 1 day ago||

From the link: "Batch processing contaminated the experiment. When the first few emails in a batch were obvious prompt injections, the agent became more suspicious of everything that followed. I had to change the setup so that each email was processed in a fresh context."

It sounds like the usability of the actual authorized user being able to email it and get things done was ruined, because if it retained context between multiple emails, the agent was ruined for actually doing anything. Running openclaw where you can't chat or email with it and have it retain context of previous interactions seems pretty useless to me.

cuchoi 1 day ago||

This openclaw was set up exclusively for the challenge.

dmagog 2 days ago||

Nice experiment, but I'd temper the optimism. "Zero breaches in 6k attempts" is a success-rate estimate, and the model is nondeterministic, so a failed jailbreak isn't proof it's blocked, just that it didn't fire on that sample. 6k different prompts isn't 6k tries of the worst one; an attack with even a 0.1% success rate usually shows zero in a handful of attempts, and the tail is what bites in production. Also, this is direct user injection, the easy case. The channel people actually lose to is indirect: untrusted content arriving via a tool result or fetched doc, which Fiu never had in the loop.

vuphanse 1 day ago||

[flagged]

danielrmay 1 day ago||

> I am less worried about prompt injection now.

Why? The exfiltration vector was known, the sample size was small, and the safety instructions were likely statically positioned. In regular operating practice, none of these three guarantees may hold.

cuchoi 1 day ago|

100%. I am less worried because I thought this would be easier to crack.

sosojustdo 1 day ago||

[flagged]

Komumech 1 day ago||

[flagged]

huntmythos 21 hours ago||

[flagged]

jickmao 1 day ago|

[flagged]

More comments...