Byte instead of token: A new paper from researchers at Microsoft Research Asia, the Central Conservatory of Music, China, and Tsinghua University introduces bGPT, a transformer model that relies on byte prediction instead of classical token prediction.
Similar attempts have been made before, but unlike other models, which are usually limited to specific formats and tasks, bGPT works directly with native binary data. Therefore, the model can handle a wide range of data types and perform tasks such as generative modeling and classification of digital media data, including text, audio, and images.
The title of the paper expresses the goal: "Beyond Language Models: Byte Models are Digital World Simulators."
By training on byte sequences, the model is designed to learn the patterns of digital systems and thus reconstruct complex systems from binary data. It also integrates different types of data into a single framework by treating everything as a byte sequence.
bGPT models CPU states, generates text, images, and audio
Byte-level processing allows bGPT to be used for more unusual generative AI applications in addition to the usual ones: The model simulated the data conversion of symbolic music data almost error-free, and achieved a low error rate of 0.0011 bits per byte when converting ABC notation to MIDI format. When simulating the behavior of simple CPUs by predicting CPU states, bGPT achieved over 99.99% accuracy in performing various operations such as data shifting, and logical and arithmetic operations. According to the team, this could be useful for interpreting operational data and emulating digital activity in hardware.
But bGPT also showed promising results for tasks such as text, image, and audio generation. For text, the 110-million-parameter model is roughly equivalent to GPT-2, with some advantages. However, the model has some limitations, such as problems with non-English terms in text generation and image generation with noticeable artifacts and noise due to sequential processing of byte-level encoding. Nevertheless, the researchers believe that a simple scaling of the model size could lead to state-of-the-art results.
By focusing on byte models, the researchers hope to reduce computational costs - and make the models and data set sizes scalable. This is because byte models could process a much broader range of native binary data.
The model, code, and examples can be found on the bGPT project page.