summary Summary

Meta introduces MegaByte, a method that could take the performance and efficiency of transformer models to a new level.


Currently, all Transformer models use tokenizers. These algorithms convert words, images, audio, or other input into tokens that can then be processed by GPT-4 or other models as a series of numbers. For language models, short words are converted to one token, and longer words are converted to multiple tokens.

Tiktokenizer visualizes how a tokenizer works. | Image:

However, the use of such tokens has some drawbacks, for example, depending on the model architecture, their processing is computationally intensive, the integration of new modalities is difficult, and they usually do not work at the level of letters. This repeatedly leads to subtle capability gaps in language models, such as the inability to count the number of "n"s in the word "mayonnaise".

These and other factors also make it difficult to handle larger inputs such as entire books, videos, or podcasts, although there are now models with GPT-4 or Claude that can handle between 32,000 and 100,000 tokens.


Metas MegaByte operates at the byte level

With MegaByte, the researchers at Meta AI now demonstrate a method that dispenses with classical tokenizers and instead processes text, images, and audio at the byte level. MegaByte first breaks down sequences of text or other modalities into individual patches, similar to a tokenizer.

Then, a patch embedder encodes a patch by losslessly concatenating embeddings of each byte, such as a letter. A global module, a large autoregressive transformer, takes as inputs and outputs those patch representations and passes them on.

Each section is then processed by a local autoregressive transformer model that predicts the bytes within a patch.

Image: Meta

According to Meta, the architecture enables a higher degree of computational parallelism, larger and more powerful models for the same computational cost, and a significant reduction in the cost of the transformers' self-attention mechanism.

The team compares MegaByte to other models, such as a simple decoder-transformer architecture or Deepmind's PerceiverAR, in tests for text, images, and audio, and shows that MegaByte is more efficient and can handle sequences of nearly a million bytes.


OpenAI's Andrej Karpathy call Meta's MegaByte "promising"

OpenAI's Andrej Karpathy called Meta's MegaByte promising work. "Everyone should hope we can throw away tokenization in LLMs," Karpathy wrote on Twitter.

The Meta AI team also sees their own results as an indication that MegaByte may have the potential to replace classic tokenizers in Transformer models.

MEGABYTE outperforms existing byte-level models across a range of tasks and modalities, allowing large models of sequences of over 1 million tokens. It also gives competitive language modeling results with subword models, which may allow byte-level models to replace tokenization.


Since the models on which the experiments were performed are well below the size of current language models, Meta plans to scale up to much larger models and datasets as a next step.

Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
  • Metas MegaByte improves Transformer models by eliminating tokenizers and instead processing at the byte level.
  • The new architecture enables greater computational parallelism and reduces the cost of the self-attention mechanism.
  • MegaByte has performed well in initial testing and will be scaled to larger models and datasets.
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.