BigCode, a joint initiative of Hugging Face and ServiceNow, introduces Starcoder and StarcoderBase, two large open-source code language models. The researchers place special emphasis on transparent and copyright-compliant data selection.
The 15.5 billion parameter Starcoder models can generate code in 86 programming languages. In a novel approach, the researchers used a method called “multi-query attention,” which allows the Starcoder models to focus on multiple parts of the code at once, rather than processing each token in turn. This enables both Starcoder models to read larger amounts of code (8K context windows) faster and more efficiently, speeding up code understanding and code generation.
According to participating researcher Lubna Ben Allal, the Starcoder models were trained on heavily curated data, which meant a lot of human effort: “We manually inspected 50–100 files for all the extensions in the selected programming languages and choose adequate filters,” Ben Allal said.
The work seems to have paid off: Both models perform better in benchmarks than any other open model that supports multiple programming languages, and even equal or surpass the OpenAI “code-cushman-001” model.
StarCoder outperforms OpenAI's code-cushman-001 and all open code generation models on HumanEval. On other benchmarks like DS-1000 the gap is even larger.
DS-1000 includes more diverse and realistic data science problems spanning 7 libraries. pic.twitter.com/H8IKs0rhqd
— Loubna Ben Allal (@LoubnaBenAllal1) May 9, 2023
This adds Starcoder to the growing list of open-source AI models that can compete with proprietary industrial AI models, although Starcoder’s code performance may still lag GPT-4.
Starcoder team respects privacy and copyrights
Both models also aim to set a new standard in data governance. The team says it has only used permissible data without personal references for data training, and has also implemented an opt-out mechanism and a code snippet search engine in case you want to check if your code is included in the data used from The Stack database.
The team releases the Starcoder model under the Open Responsible AI Model license, which supports commercial use. The model is not instruction-optimized out of the box, but can be optimized as a technical assistant with some additional instructions. All relevant further information and links can be found at HuggingFace Starcoder.