What happened after 2k people tried to hack my AI assistant

artcytech 1 day ago|

That's hilarious

emrehan 1 day ago||

great project! this inspired me to work on an variation.

collaborate with me: contact@hackmyhermes.com

imtringued 1 day ago||

Based on the few published subjects, it doesn't look like anyone actually tried to get the secrets.

Usually the way to go in situations like this is to flood the context window.

You will either hit a bug in the context management (sliding window removes the system prompt) or you have diluted the context with so much new information that the attention mechanism stops focusing on the system prompt.

The author also shows that he doesn't understand what batching in the LLM space means, because they conflated the idea of processing multiple emails in one context window as "batching", when that is actually sequential processing. Actual batching would process each email with an independent context window.

nnevatie 1 day ago||

Yeah, no. I definitely wouldn't consider this a solid conclusion. The attempts pasted to the article look...pretty tame.

fabijanbajo 1 day ago||

how much of the win was the model versus the constraints?

whacked_new 1 day ago||

Another potential weakness that isn't immediately clear from this experiment is if the experiment was run much longer (disregarding cost) then perhaps then the agent's memory could be susceptible to more long term memory compaction corruption and thus made more compliant?

elzbardico 1 day ago||

Most of the attacks seem to be pretty naive, if he couldn't find anything better to put on the small examples list. On the other hand, someone who knows what they are doing, will probably not going to participate in an experiment like that.

aitchnyu 1 day ago||

Umm, is anybody depending on the model to separate data from instructions? Pydantic (popular in Python ecosystem) raised VC money to make AI conversations safe.

yieldcrv 1 day ago||

alright system design savants, what's the solution for accepting this high volume of emails? retaining email as the sole intake method

fnord77 1 day ago|

brave move using Opu$ for clawd

clbrmbr 1 day ago|

With $onnet he would have gotten pwned. Or at least I’d love to see a comparison against other models.