MCP-Universe benchmark shows GPT-5 fails more than half of real-world orchestration tasks

Salesforce AI Research has developed MCP-Universe, an open-source benchmark that assesses the performance of language models (LM) based on their interactions with the Model Context Protocol (MCP) in real-world situations.
MCP captures how well a model performs by testing it on various tasks, including location navigation, financial analysis and browser automation, and assigning marks based on its functionality.
In testing GPT-5 achieved the highest success rate, but the models struggled with long speeches, in particular with navigation, automation of browsers and financial analysis, where their performance dropped significantly
These findings show that current LMs are yet unable to perform diverse real-world tasks and MCP-Universe provides a necessary testbed for their evaluation.

Fast Feed