Hacker News with Generative AI: AI Evaluation

Nvidia Outperforms GPT-4o with Open Source Model (github.com/lmarena)
Arena-Hard-Auto-v0.1 (See Paper) is an automatic evaluation tool for instruction-tuned LLMs. It contains 500 challenging user queries sourced from Chatbot Arena. We prompt GPT-4-Turbo as judge to compare the models' responses against a baseline model (default: GPT-4-0314). Notably, Arena-Hard-Auto has the highest correlation and separability to Chatbot Arena among popular open-ended LLM benchmarks (See Paper). If you are curious to see how well your model might perform on Chatbot Arena, we recommend trying Arena-Hard-Auto.
Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Models (arxiv.org)