Content
summary Summary
Update
  • Programmer and lawyer Matthew Butterick has filed a lawsuit against Microsoft, Github, and OpenAI, claiming that GitHub Copilot reproduces code snippets from developers without attribution, in violation of open source licenses. OpenAI's AI model Codex is the basis for Copilot.
  • The lawsuit amounts to nine billion US dollars in damages. A rather symbolic value that results from extrapolated license violations, because Butterick says he is primarily concerned with protecting the open source scene. In his view, it is being hijacked by programming tools such as Copilot and monetized without permission.
  • Butterick filed the lawsuit in the federal district court for Northern California. The law firm is looking for more people who feel Copilot is wrong.

Code-generating AI systems aim to speed up programming. A new dataset forms the basis for an open-source code AI.

AI startup Hugging Face and ServiceNow Research recently announced "BigCode," a new project for an open-source code AI. The two companies emphasize "open and responsible" development.

Bigger than OpenAI Codex, smaller than Deepmind AlphaCode

As a first step, BigCode aims to provide a dataset for training an open-source code AI with 15 billion parameters.

OpenAI's Codex model, the basis of Microsoft's Github Copilot, has about 12 billion parameters. Deepmind's AlphaCode, which has not yet been published, has 41.4 billion parameters and is said to be capable of human-level programming.

Ad
Ad

ServiceNow wants to use its GPU cluster for AI training. An adapted version of Nvidia's large Transformer language model Megatron serves as the basis. The project is looking for support from AI researchers on the following topics:

  • A representative evaluation suite for code LLMs, covering a diverse set of tasks and programming languages
  • Responsible data governance and development for code LLMs
  • Faster training and inference methods for LLMs

BigCode wants to tackle the copyright problem of code AIs

BigCode aims to avoid one main criticism of Codex and AlphaCode: OpenAI and Deepmind's models are trained using code examples from the Internet, some of which are copyrighted or at least not explicitly licensed for training an AI.

Similar to art and text AIs, this can lead to protests from those groups who feel ignored or professionally threatened by AI generation. For example, Codex once accurately replicated entire sections of code from an old video game by Star developer John Carmack.

Developer and lawyer Matthew Butterick is currently investigating with a team whether and to what extent Copilot violates licensing terms and is seeking litigation. He sees Copilot as a more convenient way to access open source code, but one that ignores common open source licensing terms and thus harms the scene.

BigCode wants to ensure copyright clarity from the start: All examples used for AI training must be under the Apache 2.0 license. The generated code is also under the Apache 2.0 license. In individual cases, it is also possible to provide code under alternative licenses.

Recommendation

The current training dataset, "The Stack," contains more than three terabytes of licensed source code files for 30 programming languages crawled from GitHub, according to the project. Developers who discover unauthorized or unwanted code in the Stack dataset can submit a removal request.

Github CEO Thomas Dohmke expects up to 80 percent of code to be written by AI systems in the next five years. Developers using Copilot can allegedly complete tasks about 55 percent faster.

Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • BigCode is a project for an open source code AI. The model is supposed to have more parameters than Codex from OpenAI.
  • The basis for the AI training is the code dataset "The Stack" with more than three terabytes of code examples from Github.
  • BigCode places particular emphasis on ensuring that The Stack does not violate any copyrights, as is being discussed with OpenAI's Codex and Deepmind's AlphaCode.
Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.