Microsoft's RUBICON tells if your AI coding buddy is actually helping or just slacking off

Midjourney prompted by THE DECODER

Microsoft researchers have developed RUBICON, a technique to automatically assess the quality of conversations between software developers and AI assistants. The system generates tailored evaluation criteria.

Evaluating AI assistants like GitHub Copilot is challenging for tool developers because the quality of human-AI interactions is difficult to gauge due to the variety of tasks and complexity of conversations.

Microsoft researchers now present RUBICON, a technique to automatically assess the quality of such domain-specific conversations. RUBICON stands for "Rubric-based Evaluation of Domain Specific Human-AI Conversations" and was presented at the AIware conference 2024.

Two analog conversations conducted by the debugger AI assistant were evaluated using some representative rubrics. The conversation on the right was rated as better, the one on the left as less good. | Image: Microsoft

The system consists of three main components: generating evaluation criteria, selecting the most relevant criteria, and the actual assessment of conversations. To generate criteria, RUBICON first analyzes a training dataset of conversations labeled as positive or negative. It then identifies patterns indicating user satisfaction or dissatisfaction.

Better Coding-AI

Unlike previous approaches, RUBICON incorporates principles of effective communication, such as Grice's Conversational Maxims (which capture four dimensions of conversational effectiveness: quantity, quality, relevance, and manner), and domain-specific knowledge.

This tailors the generated criteria to the specific application domain. In a second step, RUBICON uses an iterative process to select a subset of the generated criteria that best distinguish between positive and negative conversations. Finally, a large language model evaluates the conversations to be tested based on the selected criteria and a determined threshold.

The researchers evaluated RUBICON using 100 conversations between developers and an AI assistant for debugging in C#. The results showed that the criteria generated by RUBICON allowed for a significantly better distinction between positive and negative conversations than criteria from previous methods or manually created criteria.

With RUBICON, 84% of conversations could be classified as positive or negative with a precision of over 90%. Previous methods achieved a maximum of 64%.

According to Microsoft, RUBICON has already been successfully used in a popular development environment of a large software company to monitor two AI assistants.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Recommendation

AI research

Microsoft's RUBICON tells if your AI coding buddy is actually helping or just slacking off

Better Coding-AI

Meta's latest model highlights the challenge AI faces in long-term planning and causal reasoning

AI chatbots use different sources than Google search and often cite less-known websites

Junk data from X makes large language models lose reasoning skills, researchers show

Researchers discover three factors that make AI agents significantly smarter

ChatGPT's memory could turn personal details into ads OpenAI CEO Altman once called dystopian

The long-predicted deepfake dystopia has arrived with Sora 2

Anthropic claims to lower the entry barrier for advanced AI models with Claude Haiku 4.5

Microsoft's RUBICON tells if your AI coding buddy is actually helping or just slacking off

Better Coding-AI

Share

Bank details