Ad
Skip to content

Starcoder is a performant open-source model for copyright-compliant code

Image description
Midjourney prompted by THE DECODER

BigCode, a joint initiative of Hugging Face and ServiceNow, introduces Starcoder and StarcoderBase, two large open-source code language models. The researchers place special emphasis on transparent and copyright-compliant data selection.

The 15.5 billion parameter Starcoder models can generate code in 86 programming languages. In a novel approach, the researchers used a method called "multi-query attention," which allows the Starcoder models to focus on multiple parts of the code at once, rather than processing each token in turn. This enables both Starcoder models to read larger amounts of code (8K context windows) faster and more efficiently, speeding up code understanding and code generation.

According to participating researcher Lubna Ben Allal, the Starcoder models were trained on heavily curated data, which meant a lot of human effort: "We manually inspected 50–100 files for all the extensions in the selected programming languages and choose adequate filters," Ben Allal said.

The work seems to have paid off: Both models perform better in benchmarks than any other open model that supports multiple programming languages, and even equal or surpass the OpenAI "code-cushman-001" model.

This adds Starcoder to the growing list of open-source AI models that can compete with proprietary industrial AI models, although Starcoder's code performance may still lag GPT-4.

Starcoder team respects privacy and copyrights

Both models also aim to set a new standard in data governance. The team says it has only used permissible data without personal references for data training, and has also implemented an opt-out mechanism and a code snippet search engine in case you want to check if your code is included in the data used from The Stack database.

The team releases the Starcoder model under the Open Responsible AI Model license, which supports commercial use. The model is not instruction-optimized out of the box, but can be optimized as a technical assistant with some additional instructions. All relevant further information and links can be found at HuggingFace Starcoder.

AI News Without the Hype – Curated by Humans

Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section.

Read on for the full picture.
Subscribe for hype-free coverage.

  • Access to all THE DECODER articles.
  • Read without distractions – no Google ads.
  • Access to comments and community discussions.
  • Weekly AI newsletter.
  • 6 times a year: “AI Radar” – deep dives on key AI topics.
  • Up to 25 % off on KI Pro online events.
  • Access to our full ten-year archive.
  • Get the latest AI news from The Decoder.
Subscribe to The Decoder