Why eval startups fail (2025)

Posted by jxmorris12 2 days ago

Why eval startups fail (2025)(thomasliao.com)

109 points | 56 commentspage 2

GL26 1 day ago||

The problem with eval is the fact that the information is not updating itself fast enough so that you want the latest model performance benchmarks. Bloomberg succeeded because it sells info that is expires in the next hour.

jampekka 1 day ago||

I think there's gonna be (or perhaps already is) a huge demand for evaling individual systems. Many countries are starting to adopt some criteria for LLM usage for public use, and I doubt govs are gonna develop in-house knowhow for this. These will likely form some kinds of "independent auditor" models, as the system provider has too strong conflicts of intetest.

It's probably not gonna be exactly glorious work, but designing expert evals settings and collecting and crunching the data for quality assurance and control is going to be needed.

0xWTF 1 day ago||

I can see where Goodhart's Law applies to psychology and economics, pretty much any man-made domain without IDLH (immediate-danger-to-life-and-health) outcomes. But I think it's going to be hard to Goodhart a lot of medical AI safety. Biology doesn't give a shit.

However, identifying the right metrics and having the necessary test sets will, at times, be challenging.

torginus 1 day ago||

Imo it's very simple - AI is a big function inverter. If you have a better cost function than frontier labs, as in, you are better at judging model output quality, then you can use that cost function to RL the next generation of models.

Therefore your knowledge is better used in training than letting users be slightly better at the token casino. Which is mentioned in this post as well, eval startup people either go to work at frontier labs or finetune startups.

dippogriff 1 day ago||

The current way benchmarks are done and are accepted by the community makes for really uninspired work. Until we're willing to break out of this rigid evaluation format prone to crazy overfitting and gaming, talent will move elsewhere. It is kind of a chicken and egg problem though.

999900000999 1 day ago||

I’m convinced the only way to make a startup work, with a few exceptions, is to give away your product or sell below cost.

For years upon years until you get brought out. Then it’s someone else’s problem. Or you IPO and bring in new management to figure out how to make money.

VCs don’t see 20x exits happening for Eval companies, so they have trouble with the losing money for years step

jdw64 1 day ago||

If you look at the history of software engineering, the ones that made the most money were usually not the companies that built the applications themselves, but the ones that built the tools to verify, deploy, and build them, such as CI/CD, static analysis tools, and testing frameworks.

Personally, I agree with the Goodhart problem, but isn't the reason Eval startups fail because they try to sell an 'evaluation service' rather than a 'verification toolchain'? The problem, it seems, is that AI verification toolchains require a model in the end, because they internalize AI and sell it under the name of a 'harness.'

So an AI verification(eval) toolchain would have to be structurally different. Verifying AI code isn't about whether it compiles. AI code can always be made to compile. The issue involves various semantic criticisms, such as overfitting to existing designs and tests. To catch those issues, you ultimately need to build an AI. But building that AI is expensive. So in the end, AI verification companies depend on external model providers for the core components of their verification engine. I think this is a bad business decision

whinvik 1 day ago||

> made the most money

> built the tools to verify, deploy, and build them, such as CI/CD, static analysis tools, and testing frameworks.

Curious. Which company made money with testing frameworks?

jdw64 1 day ago||

I thought about mentioning Atlassian (Jira) and JetBrains, but come to think of it, they aren't really testing frameworks. They cover the entire development workflow overall. I guess I was thinking too short.

noelwelsh 1 day ago||

The "shovels for gold miners" analogy is generally a good one. It applies to Nvidia, for example. It doesn't generally apply to developers though. Developer tooling is notoriously difficult to monetize. Developers themselves are a shovel.

brandensilva 1 day ago||

Devs are hard to market and sell too I've heard. It's likely because they can build a lot of the stuff out there themselves when pressed. They have the most app exposure so are opinionated. It's why most devs take the open source spoils while everyone else avoids GitHub in general. Although AI has made it easy to setup locally, many still don't see the value of controlling their software or ai agents fully like devs.

david_shi 1 day ago||

> I believe eval startups can work when they're targeting safety benchmarks specifically.

Are there any examples of successful startups doing this?

Chu4eeno 16 hours ago|

In addition to naming one, I'd also be interesting in whether they actually do rigorous work.

The safety research that tends to get headlines is often extremely misleading, usually with directed prompting, or unreported additions to the system prompt specifying model roleplay behavior.

PaulHoule 1 day ago|

Worked or tried to work for a few places that ended eval work in the 2010s for previous-gen systems. Most didn’t pay for it, thanks to all the ones that didn’t I didn’t dare try selling it to the one that would have.

More comments...