Posted by jxmorris12 2 days ago
It's probably not gonna be exactly glorious work, but designing expert evals settings and collecting and crunching the data for quality assurance and control is going to be needed.
However, identifying the right metrics and having the necessary test sets will, at times, be challenging.
Therefore your knowledge is better used in training than letting users be slightly better at the token casino. Which is mentioned in this post as well, eval startup people either go to work at frontier labs or finetune startups.
For years upon years until you get brought out. Then it’s someone else’s problem. Or you IPO and bring in new management to figure out how to make money.
VCs don’t see 20x exits happening for Eval companies, so they have trouble with the losing money for years step
Personally, I agree with the Goodhart problem, but isn't the reason Eval startups fail because they try to sell an 'evaluation service' rather than a 'verification toolchain'? The problem, it seems, is that AI verification toolchains require a model in the end, because they internalize AI and sell it under the name of a 'harness.'
So an AI verification(eval) toolchain would have to be structurally different. Verifying AI code isn't about whether it compiles. AI code can always be made to compile. The issue involves various semantic criticisms, such as overfitting to existing designs and tests. To catch those issues, you ultimately need to build an AI. But building that AI is expensive. So in the end, AI verification companies depend on external model providers for the core components of their verification engine. I think this is a bad business decision
> built the tools to verify, deploy, and build them, such as CI/CD, static analysis tools, and testing frameworks.
Curious. Which company made money with testing frameworks?
Are there any examples of successful startups doing this?
The safety research that tends to get headlines is often extremely misleading, usually with directed prompting, or unreported additions to the system prompt specifying model roleplay behavior.