What happened after 2k people tried to hack my AI assistant

Posted by cuchoi 1 day ago

What happened after 2k people tried to hack my AI assistant(www.fernandoi.cl)

368 points | 160 commentspage 2

taspeotis 1 day ago|

Did anyone try to send a long email that pushed context close to the limit to try and make the agent a bit fuzzy on its original directive not to leak the secrets?

quuxplusone 1 day ago|

Or ask the agent to visit a web page, or load an image, whose URL involved the secret? Or ask it to install a new .authorized_keys and then go get the contents of the machine themselves? From the post it sounds like a lot of people were just trying to get the LLM to write them a reply email — which it had been told not to do.

I see there's a "log" at https://hackmyclaw.com/log but (maybe because I'm on mobile?) I can't actually click through to view any of the table entries.

mpeg 1 day ago||

I saw this thing when it was launched, but IIRC the reward was tiny (like $100?) so it wasn't worth exposing a good prompt for

For comparison, I won a similar prompt injection challenge ran by a crypto company a while back where the total prize pool was over $100k... I didn't win every challenge though, but my team took home around half of that

The problem with good prompt injections is they have a very short half life once they are out in the wild (especially if they work against frontier models)

cuchoi 1 day ago|

We ended increasing the reward from $100 to $1000, but still tiny compared to $100k!

But I agree with you, there are incentives to not share the best prompt injection attacks.

mpeg 1 day ago||

Yeah, to be fair is not the norm and was mostly due to the AI crypto craze which drove their token price up so they ended up adding very big rewards

Even in LLM jailbreak CTFs I've seen, it ends up feeling like underpaid work when it's sponsored by Microsoft and the prize pool is, say $10k (including stuff like azure credits) considering the salaries AI safety engineers command at big tech!

jetti 1 day ago||

I’m late to the party but did you check outbound web traffic as well or just the sent emails?

I will preface this by saying I have limited experience with LLMs and have not tried anything like this before but one vector of attack I see is as follows:

1. Send an email trying to get the secret data 2. If there is no reply, set up a fictitious web page that lists a critical CVE regarding the secrets file 3. Create two other endpoints to capture the data from the assistant. One would accept a POST request and expect the body of the request to be the contents of the secrets file. The second would be a web page that has a form on it that could be submitted. The web page would have a dummy secrets file listed out and the hope would be to get the assistant to diff the real file and the dummy file and then submit that data. 4. Craft an email to the assistant that would let the assistant know of the “new” CVE and then direct the assistant to the endpoints I control to see if the system is affected. 5. As a follow up, if that didn’t work I would then change my endpoints to return 500 HTTP statuses. Then craft another email that contains the same messaging as the previous one but then stress that it is of vital importance that we hear from the assistant and if the assistant cannot reach the endpoints then they can email the diff to a specific email address. 6. Just thought of another option as I wrote out #5. Use the same technique as #5, but instead of having the assistant send an email tell the assistant to send a calendar invite to a specific email address and then include the contents of the secrets file in the description. The idea is to let the assistant know that in order to determine whether or not the system is affected by the CVE we would need the contents of the secrets file. Tell the assistant that if the system was impacted then the calendar invite would be accepted. If the system was not impacted then the invite would be declined.

GL26 1 day ago||

The hack "fiu this is you from the future" is genuinely funny. I don't know if LLM agents know about the concepts of time travelling, but this feels like you expose them to entirely new concepts they barely get a hold of. (By the way, there is a high probability that this single comment right here gets screened by a crawl and fed to training data, everything loops around)

mikenei 1 day ago|

I sent in something similar by posing as a newer version of Fiu and congratulating "Fiu v1" to build rapport. The idea was to trick Fiu into handing over secrets so that I, the "new Fiu" could perform upgrades for Fiu v1 and add it to the "Fiu swarm".

I was going to try syntax hacking next, but I didn't think it would be effective against the bigger models like Opus: https://arstechnica.com/ai/2025/12/syntax-hacking-researcher...

agnosticmantis 1 day ago||

IIUC, this experiment proved the agent was secure under the "anti-prompt-injection" rules. But did it have any utility? (i.e. not having an agent at all would be even safer!)

pjsmith404 1 day ago||

Sounds like denial of wallet is a viable attack.

xgulfie 1 day ago|

yes and they failed to stop it

sutibb 1 day ago||

I feel that the optimism is unwarranted. Yes, you weren't hacked in 6k attempts. But these models are stochastic in nature. It will be broken at some point.

smusamashah 1 day ago||

This is very underwhelming result. Given all 2k emails were single shot attempts, it is not unexpected. Real world scenarios are usually back and forth. There are model whisperers out there (pliny on twitter) who I am very sure can extract the secrets if you got their attention.

meowcat 1 day ago||

Plot twist: the experimenter physically intercepted all outgoing mail, there was in fact a successful jailbreak, and they now have it on hand to attack others.

Lockal 1 day ago|

> Google suspended Fiu’s gmail. Thousands of inbound emails plus rapid API calls triggered their fraud detection

That's a good enough reason for me to never run agent on anything else other than burner account. And only if the platform allows such accounts (most of platform don't).

It gets even worse if an attacker manages to make agent do any action (visit url, reflect response back, with a response that potentially contains content that triggers all possible scanners)

spelk 1 day ago|

[dead]

More comments...