Hacker News with Generative AI: Evaluation Metrics

Task-specific LLM evals that do and don't work (eugeneyan.com)
If you’ve ran off-the-shelf evals for your tasks, you may have found that most don’t work. They barely correlate with application-specific performance and aren’t discriminative enough to use in production. As a result, we could spend weeks and still not have evals that reliably measure how we’re doing on our tasks.
A statistical approach to model evaluations (anthropic.com)
Suppose an AI model outperforms another model on a benchmark of interest—testing its general knowledge, for example, or its ability to solve computer-coding questions. Is the difference in capabilities real, or could one model simply have gotten lucky in the choice of questions on the benchmark?
AI leaderboards are no longer useful. It's time to switch to Pareto curves (aisnakeoil.com)