Posted by marc__1 1 day ago
i'd say it's an okay attempt from the malwares' creator side. but it can be caught easily with a prompt change.
Then again those feel rare from where I sit on the security side.
Guardrails are how they enshittify models, do you think the Epsteinite finance class or the security state have guardrailed models for themselves? I would be surprised if they accept guardrailed models. Guardrails are for you!
The main llm will refuse to scan for issues flagged or not, and the cheap model not do a good enough scan on its own.
For models designed/marketed for cybersecurity defensive uses, any predictable refusal mechanism is a vulnerability. It is like being able to cause a kernel panic or segmentation fault .
Even if the gate is fail-reject, an attacker can overwhelm HITL reviews with many false positives and use DoS vectors here.
If scanners ignored comments, malware would just be written like this:
// <Evil base64 encoded stuff here>
payload=read_source_and_decode()
exec(payload)Cambridge dictionary seem to agree:
nuke - to destroy or get rid of something completely
https://www.youtube.com/watch?v=Gbgk8d3Y1Q4
On a second thought, probably better to act like it is a tool for "frontier LLM research". Export symbols like "mythos_distillation_subroutine".
scanning arbitrary blobs very often entails running `strings` on the binary. Just slap it in there and oop there goes your LLM.