Content
summary Summary
Update
  • Added new information from the SemiAnalysis report

Update, July 11, 2023:

A new report by SemiAnalysis reveals more details about OpenAI's GPT-4, concluding that "OpenAI is keeping the architecture of GPT-4 closed not because of some existential risk to humanity, but because what they've built is replicable." The details of the report leaked on Twitter and Pastebin, confirming most of the already known information shared by people like George Hotz.

The key points:

  • GPT-4's Scale: GPT-4 has ~1.8 trillion parameters across 120 layers, which is over 10 times larger than GPT-3.
  • Mixture Of Experts (MoE): OpenAI utilizes 16 experts within their model, each with ~111B parameters for MLP. Two of these experts are routed per forward pass, which contributes to keeping costs manageable.
  • Dataset: GPT-4 is trained on ~13T tokens, including both text-based and code-based data, with some fine-tuning data from ScaleAI and internally.
  • Dataset Mixture: The training data included CommonCrawl & RefinedWeb, totaling 13T tokens. Speculation suggests additional sources like Twitter, Reddit, YouTube, and a large collection of textbooks.
  • Training Cost: The training costs for GPT-4 was around $63 million, taking into account the computational power required and the time of training.
  • Inference Cost: GPT-4 costs 3 times more than the 175B parameter Davinci, due to the larger clusters required and lower utilization rates.
  • Inference Architecture: The inference runs on a cluster of 128 GPUs, using 8-way tensor parallelism and 16-way pipeline parallelism.
  • Vision Multi-Modal: GPT-4 includes a vision encoder for autonomous agents to read web pages and transcribe images and videos. The architecture is similar to Flamingo. This adds more parameters on top and it is fine-tuned with another ~2 trillion tokens.

Original article from June 28, 2023:

Ad
Ad

OpenAI GPT-4 is said to be based on the Mixture of Experts architecture and has 1.76 trillion parameters.

GPT-4 is rumored to be based on eight models, each with 220 billion parameters, which are linked in the Mixture of Experts (MoE) architecture. The idea is nearly 30 years old and has been used for large language models before, such as Google's Switch Transformer.

The MoE model is a type of ensemble learning that combines different models, called "experts," to make a decision. In an MoE model, a gating network determines the weight of each expert's output based on the input. This allows different experts to specialize in different parts of the input space. This architecture is particularly useful for large and complex data sets, as it can effectively partition the problem space into simpler subspaces.

No statement from OpenAI, but the rumors are credible

The information about GPT-4 comes from George Hotz, founder of Comma.ai, an autonomous driving startup. Hotz is an AI expert who is also known for his hacking past: He was the first to crack the iPhone and Sony's Playstation 3.

Other AI experts have also commented on Hotz's Twitter feed, saying that his information is very likely true.

Recommendation

What can open-source learn from GPT-4?

The architecture may have simplified the training of GPT-4 by allowing different teams to work on different parts of the network. This would also explain why OpenAI was able to develop GPT-4's multimodal capabilities independently of the currently available product and release them separately. In the meantime, however, GPT-4 may have been merged into a smaller model to be more efficient, speculated Soumith Chintala, one of the founders of PyTorch.

Hotz also speculated that GPT-4 produces not just one output, but iteratively 16 outputs that are improved with each iteration.

The open-source community could now try to replicate this architecture; the ideas and technology have been available for some time. However, GPT-4 may have shown how far the MoE architecture can go with the right training data and computational resources.

Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • OpenAI's GPT-4 is reportedly based on the "Mixture of Experts" architecture and includes 1.76 trillion parameters.
  • This architecture combines multiple models for decision making and can be particularly useful with large data sets.
  • The information comes from George Hotz, an AI expert, and has gained credibility in the AI community.
  • Open source developers could try to replicate this architecture and learn from GPT-4's advances.
Sources
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.