Top
Best
New

Posted by jxmorris12 3 days ago

Why eval startups fail (2025)(thomasliao.com)
109 points | 56 commentspage 3
h1fra 1 day ago|
evals are glorified integration tests, would you invest in an integration test startup? absolutely not. I don't get why we are making all of this fuzz around evals
hilariously 1 day ago||
Because what people actually want is a simple harness to test their use cases against all the frontier models and see which is the cheapest/best for the job.

It's simple to say but hard to master doing well, and the important thing is that no matter what tool you have the evals don't write themselves.

pydry 1 day ago||
There are a number of integration test startups. None of them do a great job but they do exist.
bitlad 1 day ago||
Everything eventually fails. Nothing is constant, not even evals.
Etheryte 1 day ago|
Except regex, no matter how technologically advanced your company, somewhere someone is slapping regex on something that has no business being regexed.
bryanrasmussen 1 day ago|||
You're in a business, and you think, to improve things I'm going to slap a regex on this. Now you're in two businesses.
Asmod4n 1 day ago|||
And llms seeing this keep on repeating that mistake, like trying to parse html with regexp.
wseqyrku 1 day ago||
> Not enough eval customers

Aha.

coldtea 1 day ago||
Because they operate on untrusted input
redwood 1 day ago||
I found this pretty hard to read as the author has a very specific understanding of what an eval startup means but it is only implied rather than explicitly described. I would have thought that they were referring to the companies that provide a technology platform to enable you to do evals in an AI application context for example companies like Comet/Opik and Braintrust.

But it sounds like the author does not mean those companies at all since those are actually important in enabling the very Venn diagram he/she describes.

Based on what I assume the author's referring to they are referring to something more like a public benchmark report provider... I would say but yes that's a relatively small total addressable Market space no matter how you look at it

intended 1 day ago|
Funnily enough, this made immediate sense to me, and I think it derives from being a situation where you need high reliability from a process, eg: I need a bot which has a 99.99% guarantee to not go out of bounds or say something incorrect.
woggy 20 hours ago||
Wtf is an eval startup?
shivanshu23e 1 day ago||
[flagged]
gunaclksy 1 day ago|
[dead]