Hacker News with Generative AI: Evaluation

Evaluating AI Agents with Azure AI Evaluation (microsoft.com)
Artificial intelligence agents are rapidly evolving from simple chatbots to agentic AI systems capable of planning, tool use, and autonomous decision-making. With this increased sophistication comes a pressing need for equally sophisticated evaluation methods. How do we measure if an AI agent is doing the right thing, using its tools correctly, and staying on task?
Show HN: Pi Co-pilot – Evaluation of AI apps made easy (withpi.ai)
Experimentation Matters: Why Nuenki isn't using pairwise evaluations (nuenki.app)
Nuenki's old language translation quality benchmark used a simple system where a suite of LLMs would score the outputs of other LLMs between 1 and 10.
Strengthening AI Agent Hijacking Evaluations (nist.gov)
Large AI models are increasingly used to power agentic systems, or “agents,” which can automate complex tasks on behalf of users.
Chatbots Are Cheating on Their Benchmark Tests (theatlantic.com)
AI programs train on questions they’re later tested on. So how do we know if they’re getting smarter?
Can We Trust AI Benchmarks? A Review of Current Issues in AI Evaluation (arxiv.org)
Quantitative Artificial Intelligence (AI) Benchmarks have emerged as fundamental tools for evaluating the performance, capability, and safety of AI models and systems.
How do we evaluate vector-based code retrieval? (voyageai.com)
Despite the widespread use of vector-based code retrieval, evaluating the retrieval quality of embedding models for code retrieval is a common pain point.
You're Not Testing Your AI Well Enough (tryreva.com)
Large Language Models (LLMs) have revolutionised machine learning, offering unprecedented versatility across various tasks. However, this flexibility poses a significant challenge: how do we effectively evaluate LLMs to ensure they’re suitable for specific applications?
Generated Checklists Improve LLM Evaluation and Generation (arxiv.org)
Given the widespread adoption and usage of Large Language Models (LLMs), it is crucial to have flexible and interpretable evaluations of their instruction-following ability.
We're approaching LLM prompt evaluation at QA.tech (qa.tech)
The development of autonomous agents poses a unique challenge that other types of applications don’t typically grapple with: heavy reliance on inherently non-deterministic dependencies at multiple points within the system.
Show HN: Anton the Search Relevance Evaluator (objective.inc)
IRL 25: Evaluating Language Models on Life's Curveballs (alignedhq.ai)
Everyone Is Judging AI by These Tests. Experts Say They're Close to Meaningless (themarkup.org)
eBook on building LLM system evals (forestfriends.tech)
Usefulness Grounds Truth (invertedpassion.com)
How to think about creating a dataset for LLM fine-tuning evaluation (mlops.systems)
Show HN: Paramount – Human Evals of AI Customer Support (github.com/ask-fini)
Evaluation of Machine Learning Primitives on a Digital Signal Processor (diva-portal.org)
Measure Schools on Student Growth (brettcvz.com)
Lessons from the trenches on reproducible evaluation of language models (arxiv.org)
I just spent the past 5 hours comparing LLMs (ycombinator.com)
LLM Leaderboard with explanations of what each score means (crfm.stanford.edu)
An unbiased evaluation of Python environment and packaging tools (2023) (alpopkes.com)