TIGER-Lab Introduces MMLU-Pro: An Upgraded Version of the MMLU Dataset

MMLU-Pro

We at LLM Explorer love following developments in the LLM scene, both in model advancements and LLM benchmarks. And today we're happy to share some great news from TIGER-Lab—they've introduced an upgraded version of the MMLU dataset, called MMLU-Pro.

The dataset is here.

MMLU-Pro is a more robust and challenging multi-task understanding dataset designed to rigorously benchmark large language models. It contains 12K complex questions across various disciplines.

Here are the key differences compared to the original MMLU:

  1. Increased Options: MMLU-Pro has 10 answer options per question, compared to 4 in the original. This makes the evaluation more realistic and challenging, significantly reducing the score from random guessing.

  2. Higher Difficulty: The new dataset includes more reasoning-focused problems, increasing overall difficulty. Consequently, Chain-of-Thought (CoT) reasoning can outperform Perplexity (PPL) by up to 20%.

  3. Performance Stability: Due to the increased options, model performance on MMLU-Pro is more stable. For example, Llama-2-7B shows less than 1% performance variance with different prompts, compared to 4-5% on the original MMLU.

Additionally, the team found that GPT-4o (71%) improves over GPT-4-turbo (62%) by 9% on MMLU-Pro, whereas the improvement on the original MMLU is only around 2%.

Was this helpful?
Our Social Media →  
Original data from HuggingFace, OpenCompass and various public git repos.
Release v2024072803