About AI Evals - Hacker News

satisfice 2 days ago|

This reads like a collection of ad hoc advice overfitted to experience that is probably obsolete or will be tomorrow. And we don’t even know if it does fit the author’s experience.

I am looking for solid evidence of the efficacy of folk theories about how to make AI perform evaluation.

Seems to me a bunch of people are hoping that AI can test AI, and that it can to some degree. But in the end AI cannot be accountable for such testing, and we can never know all the holes in its judgment, nor can we expect that fixing a hole will not tear open other holes.

simonw 2 days ago||

Hamel wrote a whole lot more about the "LLM as a judge" pattern (where you use LLMs to evaluate the output of other LLMs) here: https://hamel.dev/blog/posts/llm-judge/

padolsey 1 day ago|||

I really recommend people study the measurement frailties and prompting sensitivities of LLM judges before employing them. They're valuable, but should be used with complete understanding of the risks: https://www.cip.org/blog/llm-judges-are-unreliable

hamelsmu 2 days ago|||

Appreciate it, Simon! I have now edited my post to include links to "intro to evals" for those not familiar.

petesergeant 1 day ago||

> This reads like a collection of ad hoc advice overfitted to experience that is probably obsolete or will be tomorrow

Even if it is (and very specifically I don't think it is), you've got to start somewhere, and I've not seen advice better than Hamel's kicking about anywhere. His writing helped me get my start on my own evals some months ago, for sure.