Salesforce's CRM benchmark finds AI agents struggle in real-world business scenarios

Jun 15, 2025

Sora prompted by THE DECODER

Key Points

Salesforce has launched CRMArena-Pro, a benchmark designed to evaluate AI agents in practical business situations, including multi-step conversations and data protection checks within CRM systems.
Leading models like Gemini 2.5 Pro succeed in just 58 percent of straightforward tasks, and their accuracy drops to 35 percent in extended dialogues, mainly because they often miss key questions.
Awareness of data protection is weak in large language models; only special instructions improve the detection of sensitive information, but this improvement comes at the cost of lower overall task performance.

Salesforce's new CRMArena-Pro benchmark reveals major challenges for AI agents in business contexts. Even top models like Gemini 2.5 Pro manage just a 58 percent success rate on single turns. When the dialog gets longer, performance drops to 35 percent.

CRMArena-Pro is designed to test how well large language models (LLMs) can function as agents in real-world business settings, especially for CRM tasks like sales, customer service, and pricing. The benchmark builds on the original CRMArena, adding more business functions, multi-turn dialogs, and tests for data privacy. Using synthetic data inside a Salesforce org, the team created 4,280 task instances across 19 types of business activities and three data protection categories.

Success rate plummets with longer dialogs

The results highlight the limits of today's LLMs. In simple, single-turn tasks, even advanced models like Gemini 2.5 Pro top out at about 58 percent accuracy. But as soon as the system needs to handle multi-turn conversations - asking questions to fill in missing details - performance falls to just 35 percent.

Salesforce ran extensive tests with nine LLMs and found that most models struggle to ask the right follow-up questions. In a review of 20 failed multi-turn tasks with Gemini 2.5 Pro, nearly half failed because the model didn't ask for crucial information. Models that ask more questions perform better in these scenarios.

Task completion rates (%) of LLMs on CRMArena-Pro in B2B/B2C for four skills in single/multi-turn settings — On CRMArena-Pro, Gemini 2.5 Pro usually posts the highest task completion rates for both B2B and B2C scenarios in single- and multi-turn dialogs. OpenAI's direct competitor o3(-pro) was not included in the evaluation. | Image: Salesforce AI Research

The best results came in workflow automation, such as routing customer service cases, where Gemini 2.5 Pro managed an 83 percent success rate. But accuracy dropped sharply for tasks that required understanding text or following rules, like spotting invalid product configurations or pulling information from call logs.

A previous study by Salesforce and Microsoft found similar issues: Even the most advanced LLMs became much less reliable as conversations grew longer and users revealed their needs in stages, with performance dropping by an average of 39 percent in these multi-turn scenarios.

Data privacy remains an afterthought

The benchmark also exposes gaps in data privacy. By default, LLMs almost never recognize or refuse requests for sensitive information, such as personal details or internal company data.

Only by tweaking the system prompt to explicitly reference privacy guidelines did models start to reject these requests, but at the expense of their overall performance. For example, GPT-4o increased its detection of confidential data from zero to 34.2 percent, but its task completion rate dropped by 2.7 points.

Open-source models like LLaMA-3.1 were even less responsive to prompt adjustments, suggesting they need better training to follow prioritized instructions.

Kung-Hsiang Steeve Huang, one of the authors, notes that data protection tests have rarely been included in benchmarks until now. CRMArena-Pro is the first systematic effort to measure this dimension.

AI News Without the Hype – Curated by Humans

As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.

Source: Code | Dataset | Paper | Salesforce