
Scale AI has introduced a new benchmarking tool called Seal Showdown, positioning it as a competitor to the existing leaderboards provided by LMArena, which has served as the benchmark for AI models since the launch of ChatGPT by OpenAI. This new tool allows users to compare various AI models directly and vote on their preferred options, aiming to provide a more accurate reflection of user experiences.
In a recent post on platform X, Scale AI’s CEO Jason Droege highlighted the tool’s unique ability to capture real user preferences. He stated, “Seal Showdown actually captures real preferences, powered by a platform used by real people.” This sentiment was echoed by Janie Gu, Scale AI’s head of product, who criticized existing benchmarks for relying on synthetic tests and feedback from a limited audience. According to Gu, these methods overlook the diverse ways in which users interact with AI models daily.
Introducing User-Centric Benchmarks
“They miss the full spectrum of how real people actually use models in their daily lives,”
Gu explained in a blog post about Seal Showdown. This new tool builds upon Scale AI’s previous initiative, the Safety, Evaluations, and Alignment Lab (SEAL), which focused on expert evaluations. In contrast, Seal Showdown will emphasize user testing, drawing feedback from individuals across over 100 countries, utilizing 70 languages, and incorporating insights from 200 professional domains.
One of the key features of Seal Showdown is its rich user segmentation. Rankings are derived from interactions on Scale’s Outlier platform, which verifies users’ demographics, including country, education level, profession, language, and age. This allows the platform to present how different models perform for specific user groups, thus enhancing the relevance of the rankings.
The criticism of existing leaderboards, including LMArena, is that they often reflect the interests of a niche group of hobbyists. This narrow focus can misrepresent the overall performance of language models (LLMs). Critics argue that LMArena tends to favor models from major AI firms like Google, xAI, and OpenAI, potentially skewing public perception.
Initial Rankings and Market Reaction
As the updated SEAL leaderboards went live, GPT-5 emerged as the top performer in all benchmark categories, contrasting sharply with LMArena, where models such as Google’s Gemini 2.5 Pro, 2.5 Flash, and Veo 3 dominate the rankings. While Seal Showdown aims to offer a more balanced perspective, the initial results have raised questions about whether they truly represent objective performance or merely reflect user preferences.
“Current rankings are based on a narrow group of users and their interests,”
Gu stated, emphasizing the need for broader representation in benchmarking AI models.
As Scale AI continues to refine its approach, the company is committed to transparency regarding the methodology behind Seal Showdown, aiming to provide a clearer picture of AI model performance across diverse user experiences. The introduction of this tool marks a significant step in the ongoing evolution of AI benchmarking, with the potential to reshape how users evaluate and select AI technologies in the future.
This development arrives as the AI landscape becomes increasingly competitive, with various companies vying for dominance in the generative AI space. Scale AI’s new tool may not only challenge LMArena’s long-standing position but also encourage more comprehensive user feedback in the evaluation process, fostering a more inclusive understanding of AI capabilities.