BioCoder is a benchmark for AI-generated bioinformatics code

Midjourney prompted by THE DECODER

BioCoder is a benchmark designed to support the development of AI models for bioinformatics.

Researchers at Yale University and Google Deepmind introduce BioCoder, a benchmark for testing the ability of AI models to generate bioinformatics-specific code. As the capabilities of ChatGPT or specialized code models grow, the models will be used for increasingly complex tasks, the team says.

Generating functional programs in bioinformatics is a significant challenge due to the amount of domain knowledge, the need for complex data operations, and the complex functional dependencies between operations, they said.

BioCoder is designed to help test these capabilities - and thus support the development of such models. The benchmark includes 2,269 coding problems and integrates real-world challenges such as dependencies, imports, and global variables to better explore the pragmatic coding capabilities of AI models.

It is based on 1026 functions and 1243 methods in Python and Java, all from bioinformatics GitHub repositories and part of peer-reviewed publications. From these, the team created code problems with prompts, context, and example solutions.

ChatGPT currently leads the BioCoder benchmark

BioCoder was used to test InCoder, CodeGen, CodeGen2, SantaCoder, StarCoder, StarCoder+, InstructCodeT5+, and ChatGPT. OpenAI's GPT-3.5 Turbo beat the other code generators so handily that the team calls the gap "surprising". "This stark contrast underscores the crucial role of both the dataset size and parameter size of the base models in accomplishing closed-domain code generation prompts," the team says.

In one experiment, however, the team was able to improve StarCoder's performance through fine-tuning. Thus, success in specialized domains such as bioinformatics is possible not only with large language models such as ChatGPT, but also with smaller, specialized models, they said. In the future, the team plans to test other open models, such as Meta's LLamA2, and expects improvements from models with longer context lengths.

BioCoder remained a challenge for ChatGPT, however, as the model only achieved an accuracy of just under 50 percent. GPT-4 has not been tested yet.

More information, benchmarks, code, and data are available on GitHub.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Recommendation

AI research

BioCoder is a benchmark for AI-generated bioinformatics code

ChatGPT currently leads the BioCoder benchmark

LLMs can outperform neuroscientists at predicting research outcomes

AI system StreamDiT generates livestream videos from text at 16 fps 512p

Researchers used 1,600 YouTube fail videos to show AI models struggle with surprises

AI coding can make developers slower even if they feel faster

OpenAI launches new ChatGPT agent that automates complex tasks for Pro, Plus, and Team

Kimi-K2 is the next open-weight AI milestone from China after Deepseek

New Energy-Based Transformer architecture aims to bring better "System 2 thinking" to AI models

BioCoder is a benchmark for AI-generated bioinformatics code

ChatGPT currently leads the BioCoder benchmark

Share

Bank details