BigCode's open-source code AI aims to solve copyright issues

Update

Programmer and lawyer Matthew Butterick has filed a lawsuit against Microsoft, Github, and OpenAI, claiming that GitHub Copilot reproduces code snippets from developers without attribution, in violation of open source licenses. OpenAI's AI model Codex is the basis for Copilot.
The lawsuit amounts to nine billion US dollars in damages. A rather symbolic value that results from extrapolated license violations, because Butterick says he is primarily concerned with protecting the open source scene. In his view, it is being hijacked by programming tools such as Copilot and monetized without permission.
Butterick filed the lawsuit in the federal district court for Northern California. The law firm is looking for more people who feel Copilot is wrong.

Code-generating AI systems aim to speed up programming. A new dataset forms the basis for an open-source code AI.

AI startup Hugging Face and ServiceNow Research recently announced "BigCode," a new project for an open-source code AI. The two companies emphasize "open and responsible" development.

Bigger than OpenAI Codex, smaller than Deepmind AlphaCode

As a first step, BigCode aims to provide a dataset for training an open-source code AI with 15 billion parameters.

OpenAI's Codex model, the basis of Microsoft's Github Copilot, has about 12 billion parameters. Deepmind's AlphaCode, which has not yet been published, has 41.4 billion parameters and is said to be capable of human-level programming.

ServiceNow wants to use its GPU cluster for AI training. An adapted version of Nvidia's large Transformer language model Megatron serves as the basis. The project is looking for support from AI researchers on the following topics:

A representative evaluation suite for code LLMs, covering a diverse set of tasks and programming languages
Responsible data governance and development for code LLMs
Faster training and inference methods for LLMs

BigCode wants to tackle the copyright problem of code AIs

BigCode aims to avoid one main criticism of Codex and AlphaCode: OpenAI and Deepmind's models are trained using code examples from the Internet, some of which are copyrighted or at least not explicitly licensed for training an AI.

Similar to art and text AIs, this can lead to protests from those groups who feel ignored or professionally threatened by AI generation. For example, Codex once accurately replicated entire sections of code from an old video game by Star developer John Carmack.

Developer and lawyer Matthew Butterick is currently investigating with a team whether and to what extent Copilot violates licensing terms and is seeking litigation. He sees Copilot as a more convenient way to access open source code, but one that ignores common open source licensing terms and thus harms the scene.

BigCode wants to ensure copyright clarity from the start: All examples used for AI training must be under the Apache 2.0 license. The generated code is also under the Apache 2.0 license. In individual cases, it is also possible to provide code under alternative licenses.

Recommendation

AI research

Rule-Based Rewards: OpenAI provides insight into the GPT-4 safety stack

The current training dataset, "The Stack," contains more than three terabytes of licensed source code files for 30 programming languages crawled from GitHub, according to the project. Developers who discover unauthorized or unwanted code in the Stack dataset can submit a removal request.

Github CEO Thomas Dohmke expects up to 80 percent of code to be written by AI systems in the next five years. Developers using Copilot can allegedly complete tasks about 55 percent faster.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

BigCode's open-source code AI aims to solve copyright issues

Bigger than OpenAI Codex, smaller than Deepmind AlphaCode

BigCode wants to tackle the copyright problem of code AIs

Rule-Based Rewards: OpenAI provides insight into the GPT-4 safety stack

Stanford researchers find AI agents improve when guided by past successes

Deepseek's latest open source language model competes with GPT-4.5

The Atlantic's new tool lets you check if your work was used to train AI models

US Copyright Office says fair use does not cover AI trained on "vast troves of copyrighted works

US think tank warns of "reverse brain drain" in China's AI sector

Researchers used AI to manipulate Reddit users, scrapped study after backlash

BigCode's open-source code AI aims to solve copyright issues

Bigger than OpenAI Codex, smaller than Deepmind AlphaCode

BigCode wants to tackle the copyright problem of code AIs

Share

Bank details