TIGER-Lab Introduces MMLU-Pro: An Upgraded Version of the MMLU Dataset
15/05/2024 13:00:01We at LLM Explorer love following developments in the LLM scene, both in model advancements and LLM benchmarks. And today we're happy to share some great news from TIGER-Lab—they've introduced an upgraded version of the MMLU dataset, called MMLU-Pro.
The dataset is here.
MMLU-Pro is a more robust and challenging multi-task understanding dataset designed to rigorously benchmark large language models. It contains 12K complex questions across various disciplines.
Here are the key differences compared to the original MMLU:
-
Increased Options: MMLU-Pro has 10 answer options per question, compared to 4 in the original. This makes the evaluation more realistic and challenging, significantly reducing the score from random guessing.
-
Higher Difficulty: The new dataset includes more reasoning-focused problems, increasing overall difficulty. Consequently, Chain-of-Thought (CoT) reasoning can outperform Perplexity (PPL) by up to 20%.
-
Performance Stability: Due to the increased options, model performance on MMLU-Pro is more stable. For example, Llama-2-7B shows less than 1% performance variance with different prompts, compared to 4-5% on the original MMLU.
Additionally, the team found that GPT-4o (71%) improves over GPT-4-turbo (62%) by 9% on MMLU-Pro, whereas the improvement on the original MMLU is only around 2%.
Recent Blog Posts
-
2024-08-03