Content
summary Summary

Microsoft researchers have developed RUBICON, a technique to automatically assess the quality of conversations between software developers and AI assistants. The system generates tailored evaluation criteria.

Ad

Evaluating AI assistants like GitHub Copilot is challenging for tool developers because the quality of human-AI interactions is difficult to gauge due to the variety of tasks and complexity of conversations.

Microsoft researchers now present RUBICON, a technique to automatically assess the quality of such domain-specific conversations. RUBICON stands for "Rubric-based Evaluation of Domain Specific Human-AI Conversations" and was presented at the AIware conference 2024.

Two analog conversations conducted by the debugger AI assistant were evaluated using some representative rubrics. The conversation on the right was rated as better, the one on the left as less good. | Image: Microsoft

The system consists of three main components: generating evaluation criteria, selecting the most relevant criteria, and the actual assessment of conversations. To generate criteria, RUBICON first analyzes a training dataset of conversations labeled as positive or negative. It then identifies patterns indicating user satisfaction or dissatisfaction.

Ad
Ad

Better Coding-AI

Unlike previous approaches, RUBICON incorporates principles of effective communication, such as Grice's Conversational Maxims (which capture four dimensions of conversational effectiveness: quantity, quality, relevance, and manner), and domain-specific knowledge.

This tailors the generated criteria to the specific application domain. In a second step, RUBICON uses an iterative process to select a subset of the generated criteria that best distinguish between positive and negative conversations. Finally, a large language model evaluates the conversations to be tested based on the selected criteria and a determined threshold.

The researchers evaluated RUBICON using 100 conversations between developers and an AI assistant for debugging in C#. The results showed that the criteria generated by RUBICON allowed for a significantly better distinction between positive and negative conversations than criteria from previous methods or manually created criteria.

With RUBICON, 84% of conversations could be classified as positive or negative with a precision of over 90%. Previous methods achieved a maximum of 64%.

According to Microsoft, RUBICON has already been successfully used in a popular development environment of a large software company to monitor two AI assistants.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Recommendation
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Microsoft researchers have developed RUBICON, a technology that automatically evaluates the quality of conversations between software developers and AI assistants by generating customized criteria, selecting the most relevant ones and evaluating the conversations based on them.
  • RUBICON incorporates principles for effective communication such as Grice's conversation maxims and domain-specific knowledge to tailor the criteria to the respective application domain.
  • In an evaluation of debugging conversations in C#, RUBICON was able to classify 84% of the conversations as positive or negative with a precision of over 90%, which significantly outperforms previous methods and underlines the importance of domain-specific knowledge and communication principles.
Sources
Kim is a regular contributor to THE DECODER. He focuses on the ethical, economic, and political implications of AI.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.