Hacker News with Generative AI: Evaluation Metrics

Towards Effective Extraction and Evaluation of Factual Claims (microsoft.com)
To address this gap, we propose a framework for evaluating claim extraction in the context of fact-checking along with automated, scalable, and replicable methods for applying this framework, including novel approaches for measuring coverage and decontextualization.

Fact Checking, Information Retrieval, Artificial Intelligence, Evaluation Metrics

5 points by lobo_tuerto 165 days ago | 0 comments

Evaluating Code Embedding Models (voyageai.com)
Despite the widespread use of vector-based code retrieval, evaluating the retrieval quality of embedding models for code retrieval is a common pain point.

Code Embeddings, Software Engineering, Machine Learning, Evaluation Metrics

25 points by fzliu 272 days ago | 3 comments

Task-specific LLM evals that do and don't work (eugeneyan.com)
If you’ve ran off-the-shelf evals for your tasks, you may have found that most don’t work. They barely correlate with application-specific performance and aren’t discriminative enough to use in production. As a result, we could spend weeks and still not have evals that reliably measure how we’re doing on our tasks.

Generative AI, Evaluation Metrics, Performance

182 points by ZeljkoS 326 days ago | 46 comments

A statistical approach to model evaluations (anthropic.com)
Suppose an AI model outperforms another model on a benchmark of interest—testing its general knowledge, for example, or its ability to solve computer-coding questions. Is the difference in capabilities real, or could one model simply have gotten lucky in the choice of questions on the benchmark?

Artificial Intelligence, Machine Learning, Evaluation Metrics, Statistics

66 points by RobinHirst11 342 days ago | 48 comments

AI leaderboards are no longer useful. It's time to switch to Pareto curves (aisnakeoil.com)

Artificial Intelligence, Evaluation Metrics

40 points by jobbagy 549 days ago | 14 comments