I am looking for solid evidence of the efficacy of folk theories about how to make AI perform evaluation.
Seems to me a bunch of people are hoping that AI can test AI, and that it can to some degree. But in the end AI cannot be accountable for such testing, and we can never know all the holes in its judgment, nor can we expect that fixing a hole will not tear open other holes.
Even if it is (and very specifically I don't think it is), you've got to start somewhere, and I've not seen advice better than Hamel's kicking about anywhere. His writing helped me get my start on my own evals some months ago, for sure.