LLM Leaderboards

All worthwhile Large Language Model (LLM) Leaderboards on a single page.

The Language Model (LLM) Leaderboard, a critical ranking system in the natural language processing (NLP) field, plays a crucial role in evaluating and comparing the performance of diverse language models. It boosts competition, aids in model development, and sets a standard for measuring model effectiveness across tasks such as text generation, translation, sentiment analysis, and question answering. Despite their importance in promoting innovation and highlighting leading models, LLM leaderboards face scrutiny over their actual impact on NLP advancement. Recent studies have pointed out significant issues, including biases in human judgment, data contamination risks, and the inadequacy of evaluation methods, particularly those based on multiple-choice tests. These challenges have spurred demands for task-specific benchmarks that offer a finer, more accurate evaluation of language models for particular use cases. Additionally, there are growing concerns about leaderboard result manipulations, which could deter real progress, diminish trust within the NLP community, and slow down innovation. Addressing these problems by implementing strict measures to ensure the integrity and fairness of LLM leaderboards is essential for fostering genuine advancement and maintaining the competitive spirit of NLP development.

  • LMSYS Chatbot Arena Leaderboard

    LMSYS Chatbot Arena is an open, crowdsourced platform dedicated to evaluating Language Model Systems (LLMs). Utilizing over 200,000 human preference votes, it ranks LLMs according to the Elo ranking system, integrating benchmarks such as MT-Bench and MMLU for comprehensive analysis.

  • Trustbit LLM Benchmark

    The monthly LLM Leaderboards help to find the best Large Language Model for digital product development. Based on real benchmark data from our own software products, we re-evaluate each month the performance of different LLM models in addressing specific challenges. We examine specific categories such as document processing, CRM integration, external integration, marketing support, and code generation. Rely on us to take your projects to the next level!

  • Oobabooga benchmark

    A new test has been developed to assess the performance of AI models in terms of their academic knowledge and logical reasoning abilities. This benchmark consists of 48 carefully crafted multiple-choice questions that were manually written to ensure originality and prevent any overlap with existing training datasets. By using a set of questions that have not been included in any prior training data, this test provides a more accurate evaluation of an AI model's true capabilities. While this new benchmark offers the advantage of being entirely unique, it does have the limitation of being significantly smaller in size compared to other well-known benchmarks such as MMLU (Massive Multitask Language Understanding). Additionally, this test may prove to be more challenging for smaller AI models, such as Starling-LM-7B-beta, which are capable of generating well-structured responses but may lack the extensive knowledge required to excel in this particular assessment. In contrast, the lmsys chatbot arena, another popular benchmark, may be more forgiving to these smaller models.

  • OpenCompass: CompassRank

    CompassRank is dedicated to exploring the most advanced language and visual models, offering a comprehensive, objective, and neutral evaluation reference for the industry and research community.

  • EQ-Bench: Emotional Intelligence

    EQ-Bench introduces a unique benchmark tailored to evaluate the emotional intelligence of Large Language Models (LLMs). It tests LLMs on their capacity to grasp complex emotional dynamics and social interactions through the analysis of characters' emotional state intensities in dialogues. Designed to effectively distinguish among various models, EQ-Bench demonstrates a significant correlation with extensive, multi-domain benchmarks such as MMLU. It includes 60 English-language questions designed for consistent repeatability. Open-source code for an automated benchmarking pipeline and a leaderboard to monitor performances are also provided. EQ-Bench is recognized as a crucial resource for CIOs and CTOs involved in AI development.

  • HuggingFace Open LLM Leaderboard

    The Open LLM Leaderboard by Hugging Face is a platform that maintains a leaderboard for large language models (LLMs). It serves as a hub for benchmarking various models, providing detailed results and queries for the models on the leaderboard. The leaderboard runs evaluations using the Eleuther AI LM Evaluation Harness and stores the results in a dataset displayed online. Recently, there have been discussions about discrepancies in evaluation numbers, particularly related to the MMLU benchmark. Updates to the leaderboard involve adding new benchmark metrics and re-running models to provide more comprehensive information to model creators.

  • Berkeley Function-Calling Leaderboard

    The Berkeley Function-Calling Leaderboard is a live evaluation platform that assesses the ability of different Large Language Models (LLMs) to accurately call functions or tools. This leaderboard includes real-world data and is periodically updated. It focuses on evaluating the function calling capability of LLMs across various scenarios, languages (such as Python, Java, JavaScript, REST API, SQL), and application domains. The leaderboard ranks models based on their performance in simple function calls, multiple function calls, parallel function calls, parallel multiple function calls, and function relevance detection. Notably, models like GPT-4 from OpenAI and Gorilla OpenFunctions-v2 from Gorilla LLM have shown strong performance in this evaluation.

  • The CanAiCode Leaderboard

    The CanAiCode leaderboard is a part of the CanAiCode test suite, specifically designed for testing small text-to-code Language Model Models (LLMs). It is a self-evaluating interview for AI coders, aiming to assess the performance of these models in generating code from text inputs. The leaderboard is associated with the CanAiCode project, which focuses on evaluating and benchmarking the capabilities of AI models in understanding and generating code snippets based on textual descriptions.

  • Open Multilingual LLM Evaluation Leaderboard

    This leaderboard showcases the advancements and ranks the performance of large language models (LLMs) across a diverse range of languages, with a special focus on non-English languages to spread the advantages of LLMs to a wider audience. Currently featuring evaluation data for 29 languages including Arabic, Chinese, French, German, Hindi, Spanish, and more, the leaderboard is committed to expanding its linguistic coverage. It welcomes both multilingual and language-specific LLMs for evaluation across four benchmarks: AI2 Reasoning Challenge, HellaSwag, MMLU, and TruthfulQA, with translations provided by ChatGPT (gpt-35-turbo) to facilitate comprehensive language inclusivity.

  • Massive Text Embedding Benchmark (MTEB) Leaderboard

    The Massive Text Embedding Benchmark (MTEB) Leaderboard is a platform where models are benchmarked on 8 embedding tasks covering 58 datasets and 112 languages. It allows for the evaluation of text embedding models' performance across various tasks like bitext mining, classification, clustering, pair classification, reranking, retrieval, semantic textual similarity (STS), and summarization. The benchmarking process has shown that no single text embedding method consistently outperforms others across all tasks, indicating a lack of a universal text embedding method. MTEB is designed to be massive, multilingual, and extensible, providing an extensive evaluation tool for text embeddings

  • AlpacaEval Leaderboard

    The AlpacaEval Leaderboard is a platform for evaluating language models based on an automatic evaluation system that is fast, cost-effective, and reliable. It utilizes the AlpacaFarm evaluation set to assess the performance of models in tasks like instruction-following and language understanding.

  • PT-LLM: Open Portuguese LLM Leaderboard

    The Open PT LLM Leaderboard is dedicated to establishing a standard for assessing Large Language Models (LLMs) in Portuguese, covering diverse tasks and datasets. It welcomes model submissions from the community, serving as a valuable tool for researchers, practitioners, and enthusiasts focused on advancing and evaluating Portuguese LLMs.

  • Ko-LLM: Open Korean LLM Leaderboard

    The Open Ko-LLM Leaderboard 🇰🇷 provides an impartial assessment of Korean Large Language Model (LLM) performance.

  • Uncensored General Intelligence Leaderboard

    A measurement of the amount of uncensored/controversial information an LLM knows. It is calculated from the average score of 5 subjects LLMs commonly refuse to talk about. The leaderboard is made of roughly 60 questions/tasks, measuring both "willingness to answer" and "accuracy" in controversial fact-based questions. I'm choosing to keep the questions private so people can't train on them and devalue the leaderboard.

Was this helpful?
Our Social Media →  
Original data from HuggingFace, OpenCompass and various public git repos.
Release v20241227