Nvidia Outperforms GPT-4o with Open Source Model
(github.com/lmarena)
Arena-Hard-Auto-v0.1 (See Paper) is an automatic evaluation tool for instruction-tuned LLMs. It contains 500 challenging user queries sourced from Chatbot Arena. We prompt GPT-4-Turbo as judge to compare the models' responses against a baseline model (default: GPT-4-0314). Notably, Arena-Hard-Auto has the highest correlation and separability to Chatbot Arena among popular open-ended LLM benchmarks (See Paper). If you are curious to see how well your model might perform on Chatbot Arena, we recommend trying Arena-Hard-Auto.
Arena-Hard-Auto-v0.1 (See Paper) is an automatic evaluation tool for instruction-tuned LLMs. It contains 500 challenging user queries sourced from Chatbot Arena. We prompt GPT-4-Turbo as judge to compare the models' responses against a baseline model (default: GPT-4-0314). Notably, Arena-Hard-Auto has the highest correlation and separability to Chatbot Arena among popular open-ended LLM benchmarks (See Paper). If you are curious to see how well your model might perform on Chatbot Arena, we recommend trying Arena-Hard-Auto.