summary Summary

BioCoder is a benchmark designed to support the development of AI models for bioinformatics.


Researchers at Yale University and Google Deepmind introduce BioCoder, a benchmark for testing the ability of AI models to generate bioinformatics-specific code. As the capabilities of ChatGPT or specialized code models grow, the models will be used for increasingly complex tasks, the team says.

Generating functional programs in bioinformatics is a significant challenge due to the amount of domain knowledge, the need for complex data operations, and the complex functional dependencies between operations, they said.

BioCoder is designed to help test these capabilities - and thus support the development of such models. The benchmark includes 2,269 coding problems and integrates real-world challenges such as dependencies, imports, and global variables to better explore the pragmatic coding capabilities of AI models.


It is based on 1026 functions and 1243 methods in Python and Java, all from bioinformatics GitHub repositories and part of peer-reviewed publications. From these, the team created code problems with prompts, context, and example solutions.

ChatGPT currently leads the BioCoder benchmark

BioCoder was used to test InCoder, CodeGen, CodeGen2, SantaCoder, StarCoder, StarCoder+, InstructCodeT5+, and ChatGPT. OpenAI's GPT-3.5 Turbo beat the other code generators so handily that the team calls the gap "surprising". "This stark contrast underscores the crucial role of both the dataset size and parameter size of the base models in accomplishing closed-domain code generation prompts," the team says.

In one experiment, however, the team was able to improve StarCoder's performance through fine-tuning. Thus, success in specialized domains such as bioinformatics is possible not only with large language models such as ChatGPT, but also with smaller, specialized models, they said. In the future, the team plans to test other open models, such as Meta's LLamA2, and expects improvements from models with longer context lengths.

BioCoder remained a challenge for ChatGPT, however, as the model only achieved an accuracy of just under 50 percent. GPT-4 has not been tested yet.

More information, benchmarks, code, and data are available on GitHub.

Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
  • Researchers from Yale University and Google Deepmind present BioCoder, a benchmark to support AI model development in bioinformatics. The benchmark includes 2,269 coding problems and incorporates real-world challenges such as dependencies and global variables.
  • In tests with several code generators, including InCoder, CodeGen, SantaCoder, and ChatGPT, OpenAI's GPT-3.5 Turbo showed the most convincing results - a performance the research team called "surprising".
  • The team plans to explore other open models, such as Meta's LLamA2, in future tests.
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.