From GPT-4 to Llama 3: LMSYS Chatbot Arena Ranks Top LLMs

Himanshi Singh Last Updated : 21 Nov, 2024

5 min read

Introduction

Every week, new and more advanced Large Language Models (LLMs) are released, each claiming to be better than the last. But how can we keep up with all these new developments? The answer is the LMSYS Chatbot Arena.

The LMSYS Chatbot Arena is an innovative platform created by the Large Model Systems Organization, a group made up of students and teachers from UC Berkeley, UCSD, and CMU. This platform makes it easy to compare and evaluate different LLMs by allowing users to test and rate them. It’s a place where anyone interested in these models can come to find out about the latest releases and see how they stack up against each other.

LMSYS Leaderboard
Top 10 LLMs
Difference between Open Source vs Closed Source LLMs
- Open Source LLMs
- Closed Source LLMs
How does LMSYS Arena Works?
LMSYS Leaderboard Evaluation System

LMSYS Leaderboard

This leaderboard ranks various LLMs using a Bradley-Terry model, with the rankings displayed on an Elo scale. The LMSYS leaderboard collects human pairwise comparisons to determine the ranking. As of April 26, 2024, the leaderboard includes 91 different models and has collected more than 800,000 human pairwise comparisons. The models are ranked based on their performance in different categories, such as coding and long user queries. The rankings are displayed in Elo-scale, and the leaderboard is continuously updated.

Click here to start the live testing of LLMs.

Top 10 LLMs

The top and trending models based on Arena Elo Ratings are:

GPT-4-Turbo by Open AI
GPT-4-1106-preview by Open AI
Claude 3 Opus by Anthropic
Gemini 1.5 Pro API-0409-Preview by Google
GPT-4-0125-preview by Open AI
Bard (Gemini Pro) by Google
Llama 3 70b Instruct by Meta
Claude 3 Sonnet by Anthropic
Command R+ by Cohere
GPT-4-0314 by Open AI

Open AI is clearly winning the race of best LLMs so far.

Now if you’re like me and wondering why there is a term preview in front of some models then here is the answer – The term “preview” typically refers to a version of a large language model (LLM) that is made available for testing, feedback, or experimental use before its official release. This “preview” stage allows developers and users to explore the model’s capabilities, identify any issues, and provide feedback, which can be incorporated into further improvements or refinements of the model. Essentially, it’s like a beta version of the software, where it’s mostly functional and showcases new features or improvements, but might still have some bugs or limitations that need addressing before a full, stable release.

The rankings take into account the 95% confidence interval when determining a model’s ranking, and models with fewer than 500 votes are removed from the rankings.

Difference between Open Source vs Closed Source LLMs

You might have heard that Llama 3 is the best open source Large Language Model (LLM) so far. However, if you check the overall rankings, GPT-4 Turbo is at the top. Why is that? It’s because the rankings include both open source and closed source LLMs.

Look at the last column of the leaderboard—it shows the type of license each LLM has. This is important because it divides the models into two main groups: open source and closed source.

Open Source LLMs

The code behind the Open Source LLMs is publicly available. This allows anyone to inspect, understand, and even improve the model. This fosters a collaborative development environment.

Freely Available: These models have permissive licenses like Apache 2.0 or MIT, allowing unrestricted use (e.g., Mixtral-8x22b-Instruct, Zephyr-ORPO, Starling-LM-7B-beta, OpenChat-3.5, Zephyr-7b-beta).
Limited Use: Some open-source models might have restrictions attached to their licenses. These restrictions could limit commercial use (e.g., Creative Commons licenses) or restrict modifications (e.g., Copyleft licenses).(e.g., Command R+, Llama 3 ).

Closed Source LLMs

LLMs that are not publicly available and require permission or licensing to use. These are typically developed by commercial entities. (e.g., OpenAI’s GPT-4 series, Google’s Gemini series, Anthropic’s Claude series).

In short, open source LLMs offer transparency and foster collaboration, while closed-source LLMs prioritize control and potentially deliver a more polished user experience.

How does LMSYS Arena Works?

The LMSYS platform works by collecting user dialogue data to evaluate large language models (LLMs). Users can compare two different LLMs side-by-side on a given task and then vote on which LLM provided a better response. The LMSYS platform uses these votes to rank the different LLMs.

Here’s a step-by-step breakdown of how LMSYS works:

Go to LMSYS platform > ⚔️ Arena (side-by-side) and select any two different LLMs that you want to compare.

Then provide a task or prompt for the two LLMs to complete. This task can be anything that can be evaluated by a human, such as writing a poem, translating a language, or answering a question. Here I asked the models: Write a 700 words article on Top Open Source LLMs.

You’ll see two answers from different LLMs side by side. Pick the one you prefer. If you don’t like either, you can pick “Both are bad” or “Tie”.

The LMSYS platform will then use your vote to update the rankings of the two LLMs. The specific way in which the rankings are updated is based on the Bradley-Terry model, which is a statistical model that can be used to rank items based on pairwise comparisons.

LMSYS Leaderboard Evaluation System

The LMSYS leaderboard uses two main ways to rate Large Language Models (LLMs): the Elo rating system and the Bradley-Terry model.

Elo Rating System: This system, which is also used in chess, gives each LLM a score based on its performance. If an LLM wins a match, it gains points, but it loses points if it loses. The difference in points between two LLMs shows which one is likely stronger and more likely to win in future matches.
Bradley-Terry Model: This method is a bit more detailed than the Elo system. It looks at things like how tough the tasks are that the LLMs handle, giving a more detailed look at how well each LLM performs.

In the LMSYS Chatbot Arena, LLMs are like players in a game, where they interact with users and compete against each other. Each LLM starts with a basic score, and this score changes based on whether they win or lose matches. Winning against a stronger LLM gives more points, and losing to a weaker one takes away more points. This way, the ratings always reflect the current strengths of the LLMs accurately.

The Elo system is great for keeping track of how LLMs perform over time, helping to understand which models are doing well and predicting how they might do in the future. This makes it a very useful tool for seeing how new and existing models stack up against each other in the ever-changing world of AI development.

Interested in reading more about the evaluation process, check out their paper: https://arxiv.org/abs/2403.04132

Conclusion

I hope this article has helped you understand how the LMSYS leaderboard works and where you can keep track of the latest developments in large language models.

The LMSYS Chatbot Arena uses a system where users help rank the models, and it uses detailed methods to score them. This makes it a great place to really see how these models perform. Understanding these models better helps everyone use them more effectively in real-life situations.

If you know of any other resources that can help stay up-to-date in the field of Generative AI, please share them in the comments section below. Your input can help us all keep pace with this rapidly evolving technology!

Himanshi Singh

I’m a data lover who enjoys finding hidden patterns and turning them into useful insights. As the Manager - Content and Growth at Analytics Vidhya, I help data enthusiasts learn, share, and grow together.

Thanks for stopping by my profile - hope you found something you liked :)

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Reading list

Introduction to Generative AI

Introduction to Generative AI applications

No-code Generative AI app development

Code-focused Generative AI App Development

Introduction to Responsible AI

LLMS

Prompt Engineering

Finetuning LLMs

Training LLMs from Scratch

Langchain

RAG

LlamaIndex

Stable Diffusion

From GPT-4 to Llama 3: LMSYS Chatbot Arena Ranks Top LLMs

Introduction

Table of contents

LMSYS Leaderboard

Top 10 LLMs

Difference between Open Source vs Closed Source LLMs

Open Source LLMs

Closed Source LLMs

How does LMSYS Arena Works?

LMSYS Leaderboard Evaluation System

Conclusion

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang

s_tp

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

s_pltp

s_tslv

li_theme

li_theme_set

Google (11)

_gcl_au