Unified-IO 2 shows what GPT-5 could look like

The Allen Institute for AI has unveiled a new advanced AI model called Unified-IO 2. It may herald the next generation of models like GPT-5.

It is the first model that can process and produce text, image, audio, video, and action sequences. The 7-billion-parameter model was trained from scratch on a wide range of multimodal data and can be guided by prompts.

Multimodal Unified-IO 2 was trained with billions of data points

Unified-IO 2 was trained on 1 billion image-text pairs, 1 trillion text tokens, 180 million video clips, 130 million images with text, 3 million 3D assets, and 1 million robot agent motion sequences. In total, the team combined more than 120 datasets into a 600-terabyte package covering 220 visual, linguistic, auditory, and action tasks.

The encoder-decoder model makes several architectural changes to stabilize training and make effective use of multimodal signals, paving the way for larger and more powerful multimodal models.

Unified-IO 2 is unique to date

The training enables Unified-IO 2 to process, understand, and produce text. For example, the model can answer questions, compose a text based on instructions, and analyze text content. The model can also recognize image content, provide image descriptions, perform image processing tasks, and create new images based on text descriptions.

Unified-IO 2 processes more modalities than any previously available model. | Image: Lu, Clark, Lee, Zhang et al.

It can also generate music or sounds based on descriptions or instructions, as well as analyze videos and answer questions about the video. By training with robot data, Unified-IO 2 can also generate actions for robotic systems, e.g. converting instructions into action sequences for robots. Thanks to multimodal training, it can also process the different modalities and, for example, mark the instruments of an audio track on an image.

The model performs well on over 35 benchmarks, including image generation and understanding, natural language understanding, video and audio understanding, and robot manipulation. It achieves comparable or better performance than specialized models in most tasks. It also sets a new high on the GRIT benchmark for image tasks, which tests how models deal with image noise and other problems.

Unified-IO showed what was possible before GPT-4, Unified-IO 2 before GPT-5?

Its predecessor, Unified-IO, was introduced in June 2022 and was one of the first multimodal models capable of processing images and language. Around the same time, OpenAI was testing GPT-4 internally before the company introduced the large language model with GPT-4 vision in March 2023.

Unified-IO was thus an early glimpse into the future of large-scale AI models, which have now become commonplace with OpenAI's models and Google's multimodally trained Gemini. Unified-IO 2 now shows what we can expect in 2024: New AI models that can process even more modalities, perform many tasks natively through extensive learning - and have a rudimentary understanding of interactions with objects and robots. The latter could also have a positive impact on their performance in other areas.

Recommendation

AI research

Meta's latest model highlights the challenge AI faces in long-term planning and causal reasoning

The team now plans to further scale Unified-IO 2, improve data quality, and transform the encoder-decoder model into an industry-standard decoder model architecture.

More information and the code can be found on the Unified-IO 2 project page.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Unified-IO 2 shows what GPT-5 could look like

Multimodal Unified-IO 2 was trained with billions of data points

Unified-IO 2 is unique to date

Unified-IO showed what was possible before GPT-4, Unified-IO 2 before GPT-5?

Meta's latest model highlights the challenge AI faces in long-term planning and causal reasoning

Trump advisors are pushing a regulation targeting what they call "woke" AI models in the tech sector

Anthropic appears to tighten the usage limits for Claude code

OpenAI launches new ChatGPT agent that automates complex tasks for Pro, Plus, and Team

OpenAI launches new ChatGPT agent that automates complex tasks for Pro, Plus, and Team

Kimi-K2 is the next open-weight AI milestone from China after Deepseek

New Energy-Based Transformer architecture aims to bring better "System 2 thinking" to AI models

Unified-IO 2 shows what GPT-5 could look like

Multimodal Unified-IO 2 was trained with billions of data points

Unified-IO 2 is unique to date

Unified-IO showed what was possible before GPT-4, Unified-IO 2 before GPT-5?

Share

Bank details