Content
summary Summary

The Allen Institute for AI has unveiled a new advanced AI model called Unified-IO 2. It may herald the next generation of models like GPT-5.

It is the first model that can process and produce text, image, audio, video, and action sequences. The 7-billion-parameter model was trained from scratch on a wide range of multimodal data and can be guided by prompts.

Multimodal Unified-IO 2 was trained with billions of data points

Unified-IO 2 was trained on 1 billion image-text pairs, 1 trillion text tokens, 180 million video clips, 130 million images with text, 3 million 3D assets, and 1 million robot agent motion sequences. In total, the team combined more than 120 datasets into a 600-terabyte package covering 220 visual, linguistic, auditory, and action tasks.

Image: Lu, Clark, Lee, Zhang et al.

The encoder-decoder model makes several architectural changes to stabilize training and make effective use of multimodal signals, paving the way for larger and more powerful multimodal models.

Ad
Ad

Unified-IO 2 is unique to date

The training enables Unified-IO 2 to process, understand, and produce text. For example, the model can answer questions, compose a text based on instructions, and analyze text content. The model can also recognize image content, provide image descriptions, perform image processing tasks, and create new images based on text descriptions.

Unified-IO 2 processes more modalities than any previously available model. | Image: Lu, Clark, Lee, Zhang et al.

It can also generate music or sounds based on descriptions or instructions, as well as analyze videos and answer questions about the video. By training with robot data, Unified-IO 2 can also generate actions for robotic systems, e.g. converting instructions into action sequences for robots. Thanks to multimodal training, it can also process the different modalities and, for example, mark the instruments of an audio track on an image.

The model performs well on over 35 benchmarks, including image generation and understanding, natural language understanding, video and audio understanding, and robot manipulation. It achieves comparable or better performance than specialized models in most tasks. It also sets a new high on the GRIT benchmark for image tasks, which tests how models deal with image noise and other problems.

Unified-IO showed what was possible before GPT-4, Unified-IO 2 before GPT-5?

Its predecessor, Unified-IO, was introduced in June 2022 and was one of the first multimodal models capable of processing images and language. Around the same time, OpenAI was testing GPT-4 internally before the company introduced the large language model with GPT-4 vision in March 2023.

Unified-IO was thus an early glimpse into the future of large-scale AI models, which have now become commonplace with OpenAI's models and Google's multimodally trained Gemini. Unified-IO 2 now shows what we can expect in 2024: New AI models that can process even more modalities, perform many tasks natively through extensive learning - and have a rudimentary understanding of interactions with objects and robots. The latter could also have a positive impact on their performance in other areas.

Recommendation

The team now plans to further scale Unified-IO 2, improve data quality, and transform the encoder-decoder model into an industry-standard decoder model architecture.

More information and the code can be found on the Unified-IO 2 project page.

Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • The Allen Institute for AI has introduced Unified-IO 2, an advanced AI model that can process and produce text, image, audio, video, and action sequences by being trained on a wide range of multimodal data.
  • Unified-IO 2 can compose text, answer questions, recognize image content, generate music, analyze video, and even generate action sequences for robots, performing well on over 35 benchmarks.
  • The team plans to further scale Unified-IO 2 and improve data quality to optimize the model's performance in different application domains.
Sources
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.