AI research

BigCode's open-source code AI aims to solve copyright issues

Matthias Bastian

DALL-E 2 prompted by THE DECODER

Update
  • Programmer and lawyer Matthew Butterick has filed a lawsuit against Microsoft, Github, and OpenAI, claiming that GitHub Copilot reproduces code snippets from developers without attribution, in violation of open source licenses. OpenAI's AI model Codex is the basis for Copilot.
  • The lawsuit amounts to nine billion US dollars in damages. A rather symbolic value that results from extrapolated license violations, because Butterick says he is primarily concerned with protecting the open source scene. In his view, it is being hijacked by programming tools such as Copilot and monetized without permission.
  • Butterick filed the lawsuit in the federal district court for Northern California. The law firm is looking for more people who feel Copilot is wrong.

Code-generating AI systems aim to speed up programming. A new dataset forms the basis for an open-source code AI.

AI startup Hugging Face and ServiceNow Research recently announced "BigCode," a new project for an open-source code AI. The two companies emphasize "open and responsible" development.

Bigger than OpenAI Codex, smaller than Deepmind AlphaCode

As a first step, BigCode aims to provide a dataset for training an open-source code AI with 15 billion parameters.

OpenAI's Codex model, the basis of Microsoft's Github Copilot, has about 12 billion parameters. Deepmind's AlphaCode, which has not yet been published, has 41.4 billion parameters and is said to be capable of human-level programming.

ServiceNow wants to use its GPU cluster for AI training. An adapted version of Nvidia's large Transformer language model Megatron serves as the basis. The project is looking for support from AI researchers on the following topics:

BigCode wants to tackle the copyright problem of code AIs

BigCode aims to avoid one main criticism of Codex and AlphaCode: OpenAI and Deepmind's models are trained using code examples from the Internet, some of which are copyrighted or at least not explicitly licensed for training an AI.

Similar to art and text AIs, this can lead to protests from those groups who feel ignored or professionally threatened by AI generation. For example, Codex once accurately replicated entire sections of code from an old video game by Star developer John Carmack.

Developer and lawyer Matthew Butterick is currently investigating with a team whether and to what extent Copilot violates licensing terms and is seeking litigation. He sees Copilot as a more convenient way to access open source code, but one that ignores common open source licensing terms and thus harms the scene.

BigCode wants to ensure copyright clarity from the start: All examples used for AI training must be under the Apache 2.0 license. The generated code is also under the Apache 2.0 license. In individual cases, it is also possible to provide code under alternative licenses.

The current training dataset, "The Stack," contains more than three terabytes of licensed source code files for 30 programming languages crawled from GitHub, according to the project. Developers who discover unauthorized or unwanted code in the Stack dataset can submit a removal request.

Github CEO Thomas Dohmke expects up to 80 percent of code to be written by AI systems in the next five years. Developers using Copilot can allegedly complete tasks about 55 percent faster.