Summary

  • Inclusion AI, affiliated with Alibaba’s Ant Group, has proposed a new model benchmarking system for large language models (LLMs) it calls Inclusion Arena.
  • The aim is to test the performance of AI models in real-life situations, as opposed to static datasets, with the questionnaire based on user preferences.
  • It currently works within two apps, Joyland and T-Box, with users selecting their preferred answer from a series of responses, without knowing which model generated them.
  • The method uses the Bradley-Terry modelling technique to rank models, similar to that employed by Chatbot Arena, and caps the data at July 2025.
  • Anthropic’s Claude 3.7 Sonnet was the top-performing model in the initial experiments.
  • The paper states that with more data and users the leaderboard would be more robust and precise.
  • Enterprises require more informed information when selecting models, and this system aims to provide that.

By Emilia David

Original Article