Hacker News with Generative AI: Evaluation Metrics

Towards Effective Extraction and Evaluation of Factual Claims (microsoft.com)
To address this gap, we propose a framework for evaluating claim extraction in the context of fact-checking along with automated, scalable, and replicable methods for applying this framework, including novel approaches for measuring coverage and decontextualization.
Evaluating Code Embedding Models (voyageai.com)
Despite the widespread use of vector-based code retrieval, evaluating the retrieval quality of embedding models for code retrieval is a common pain point.
Task-specific LLM evals that do and don't work (eugeneyan.com)
If you’ve ran off-the-shelf evals for your tasks, you may have found that most don’t work. They barely correlate with application-specific performance and aren’t discriminative enough to use in production. As a result, we could spend weeks and still not have evals that reliably measure how we’re doing on our tasks.
A statistical approach to model evaluations (anthropic.com)
Suppose an AI model outperforms another model on a benchmark of interest—testing its general knowledge, for example, or its ability to solve computer-coding questions. Is the difference in capabilities real, or could one model simply have gotten lucky in the choice of questions on the benchmark?
AI leaderboards are no longer useful. It's time to switch to Pareto curves (aisnakeoil.com)