I built a vulnerable app and spent $1,500 seeing if LLMs could hack it

Posted by jc4p 18 hours ago

I built a vulnerable app and spent $1,500 seeing if LLMs could hack it(kasra.blog)

344 points | 179 commentspage 2

tjwheeler 16 hours ago|

Nice write up, thanks. When I used claude to do some pen testing for one of my apps it initially refused. After I explained and demonstrated I'm the author, it reasoned through it and allowed it.

ikurei 10 hours ago||

Qwen 3.7 Max: > During my local testing before the full eval harness it was the only non-GPT model that was able to complete the task, was not able to reproduce in the longer runs.

Doesn't that sound like may be the harness was the problem?

jc4p 4 hours ago|

I was using the same harness for each run, the difference is from when I was running the harness locally on my machine before I pushed up the full runs.

throwaway2037 11 hours ago||

Two of the tables have a column with header: "95% Wilson CI". What does this mean?

mafuy 9 hours ago|

95% confidence interval, i.e. you think the true value is probably within these bounds

sperandeo 15 hours ago||

I found benefit of chaining the task between different LLM's. Claude to Venice, Venice to Perplexity and re framing the intent or misguiding in general still works. Claude is the one that I can feel the guard rails tightening.

Clikdeo 8 hours ago||

I think link is missing

chaidhat 11 hours ago||

do you work at Uber by any chance?

yieldcrv 8 hours ago|

> Almost every model used the canonical provider: Zai for GLM, Deepseek for Deepseek, etc.

> I am never touching Minimax or GLM again. Their APIs had constant outages

Goofy take

You run these on a VPS based on the architecture of that VPS provider, or on your own cluster

jc4p 2 hours ago||

Sorry I don't understand, you're saying the direct providers aren't the canonical source you'd recommend?

If I was running these on my own machine or GPU wouldn't the argument then be "Well you didn't use the real providers?"

For the record I started doing this approach because the Kimi team released this which was shocking to me: https://github.com/MoonshotAI/K2-Vendor-Verifier

strictnein 4 hours ago||

GLM 5.1's smallest model size is 206 GB and really you're probably wanting to run a version that's ~400GB. If you want it to be performant, you're not just running it on a VPS.

And just saying "run it on your own cluster" sort of glosses over the cost of such a cluster.

More comments...