LMArena has some competition: Scale AI launches Seal Showdown, a new benchmarking tool

Is this the next big AI benchmarking tool in town?
a smartphone with the scale logo lays atop a laptop keyboard
Credit: Jaque Silva/NurPhoto via Getty Images

In the years since OpenAI launched ChatGPT to the world, kicking off the generative AI boom, developers have relied on LMArena (previously Chatbot Arena) as the default AI leaderboard. Now, Scale AI is bringing some much-needed competition to the AI benchmarking space with its new Seal Showdown benchmarking tool.

Like LMAerna, Seal Showdown allows users to test various AI models head-to-head and vote on which one performs better. However, Scale AI says that unlike LMAerna, Seal Showdown will more closely reflect how everyday users feel about various models. In an X post, Scale CEO Jason Droege said that Seal Showdown "actually captures real preferences, powered by a platform used by real people."

"Most benchmarks rely on synthetic tests (coding puzzles, math problems) or feedback from a small slice of people," said Scale AI’s head of product, Janie Gu, in a blog post. "They miss the full spectrum of how real people actually use models in their daily lives. By treating diverse users as a monolith and lumping all feedback into one generalized score, critical nuance is lost."


You May Also Like

Scale AI launched its Safety, Evaluations, and Alignment Lab (SEAL) leaderboards last year, but these leaderboards relied on expert evaluations. Now, ScaleAI will offer leaderboards based on user testing, offering an alternative to the LMArena.

The startup says its new benchmarking tool is based on real-world use and feedback from "users spanning over 100 countries, 70 languages, and 200 professional domains." (The company also provided the precise methodology for Seal Showdown.)

"Showdown introduces something never before seen in public leaderboards: rich user segmentation," Gu wrote in the blog post announcing the project. "Because rankings are derived from conversations that contributors have on Scale’s Outlier platform, Scale is able to verify each user’s country, education level, profession, language, and age — enabling anyone to see how models perform for people like them."

Because of this demographic information, Scale AI will be able to show which models are most popular according to specific regions, languages, ages, or use cases.

The criticism that Scale AI has with existing leaderboards is that they “rely heavily on hobbyist participation” and that current rankings are “based on a narrow group of users and their interests,” which leads to a misrepresentation of how those LLMs perform in general use.

LMArena has also been criticized for bias against open models. Critics say that LMArena's system favors frontier models from big AI companies like Google, xAI, and OpenAI. However, Scale AI's solution may not be ideal, either. The initial leaderboard results overwhelmingly rank GPT-5 the highest, which may merely reflect user preference rather than objective performance.

The updated SEAL leaderboards are live now. Currently, GPT-5 tops all of the benchmark categories, a stark contrast to LMArena, where Google's Gemini 2.5 Pro, 2.5 Flash, and Veo 3 lead most of the leaderboard categories. 


Disclosure: Ziff Davis, Mashable’s parent company, in April filed a lawsuit against OpenAI, alleging it infringed Ziff Davis copyrights in training and operating its AI systems.

Mashable Potato

Recommended For You
Stick to your fitness New Year resolutions with the Renpho Smart Scale for its lowest price yet
 Renpho Smart Scale on light green and green abstract backgroun

Score the Amzmerit smart scale for its best-ever price on Amazon — save $50 with this handy coupon code
Amzmerit smart scale with blue abstract background

This $20 smart scale is the easiest way to track your 2026 health wins
The Renpho smart scale against a pink and purple background.

Set off on your fitness journey with $24 off of a Renpho Smart Scale
Renpho Smart Scale on red and light purple abstract background

Jumpstart your fitness journey with 33% off the Renpho Smart Scale
Renpho Smart Scale on lime green background

Trending on Mashable
NYT Connections hints today: Clues, answers for April 3, 2026
Connections game on a smartphone

Wordle today: Answer, hints for April 3, 2026
Wordle game on a smartphone

What's new to streaming this week? (April 3, 2026)
A composite of images from film and TV streaming this week.

Google launches Gemma 4, a new open-source model: How to try it
Google Gemma

The biggest stories of the day delivered to your inbox.
These newsletters may contain advertising, deals, or affiliate links. By clicking Subscribe, you confirm you are 16+ and agree to our Terms of Use and Privacy Policy.
Thanks for signing up. See you at your inbox!