Content
summary Summary

BigCode, a joint initiative of Hugging Face and ServiceNow, introduces Starcoder and StarcoderBase, two large open-source code language models. The researchers place special emphasis on transparent and copyright-compliant data selection.

The 15.5 billion parameter Starcoder models can generate code in 86 programming languages. In a novel approach, the researchers used a method called "multi-query attention," which allows the Starcoder models to focus on multiple parts of the code at once, rather than processing each token in turn. This enables both Starcoder models to read larger amounts of code (8K context windows) faster and more efficiently, speeding up code understanding and code generation.

According to participating researcher Lubna Ben Allal, the Starcoder models were trained on heavily curated data, which meant a lot of human effort: "We manually inspected 50–100 files for all the extensions in the selected programming languages and choose adequate filters," Ben Allal said.

The work seems to have paid off: Both models perform better in benchmarks than any other open model that supports multiple programming languages, and even equal or surpass the OpenAI "code-cushman-001" model.

Ad
Ad

This adds Starcoder to the growing list of open-source AI models that can compete with proprietary industrial AI models, although Starcoder's code performance may still lag GPT-4.

Starcoder team respects privacy and copyrights

Both models also aim to set a new standard in data governance. The team says it has only used permissible data without personal references for data training, and has also implemented an opt-out mechanism and a code snippet search engine in case you want to check if your code is included in the data used from The Stack database.

The team releases the Starcoder model under the Open Responsible AI Model license, which supports commercial use. The model is not instruction-optimized out of the box, but can be optimized as a technical assistant with some additional instructions. All relevant further information and links can be found at HuggingFace Starcoder.

Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • BigCode introduces StarCoder and StarCoderBase, powerful open-source code language models that work in 86 programming languages.
  • The models use "multi-query attention" for more efficient code processing.
  • The team is committed to privacy and copyright compliance, and releases the models under a commercially viable license.
Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.