Stop benchmarking in the lab: Inclusion Arena shows how LLMs perform in production

Inclusion AI, affiliated with Alibaba’s Ant Group, has proposed a new model benchmarking system for large language models (LLMs) it calls Inclusion Arena.
The aim is to test the performance of AI models in real-life situations, as opposed to static datasets, with the questionnaire based on user preferences.
It currently works within two apps, Joyland and T-Box, with users selecting their preferred answer from a series of responses, without knowing which model generated them.
The method uses the Bradley-Terry modelling technique to rank models, similar to that employed by Chatbot Arena, and caps the data at July 2025.
Anthropic’s Claude 3.7 Sonnet was the top-performing model in the initial experiments.
The paper states that with more data and users the leaderboard would be more robust and precise.
Enterprises require more informed information when selecting models, and this system aims to provide that.

Fast Feed