Content
summary Summary

xAI releases preview of Grok-1.5 Vision multimodal AI model with improved understanding of the physical world.

Ad

Elon Musk's AI startup xAI has released a preview of Grok-1.5 Vision, its first multimodal AI model, which the company says has a better understanding of the physical world than its competitors.

In addition to standard text capabilities, Grok-1.5V can process various visual information, including documents, diagrams, graphics, screenshots, and photos. The model will soon be available to early testers and current Grok users.

xAI claims that Grok-1.5V is competitive with today's best multimodal models in several areas, from multidisciplinary reasoning to understanding documents, scientific diagrams, graphics, screenshots, and photos.

Ad
Ad

The company has published a table comparing Grok's performance on various benchmarks against competing models such as OpenAI's GPT-4, Anthropic's Claude, and Gemini Pro. Grok achieved equal or better results in most areas.

xAI highlights Grok's ability to understand the physical world as measured by the newly introduced RealWorldQA benchmark, which tests spatial understanding of the real world. Grok outperformed competitors using zero-shot prompts (no example solutions given in the prompt).

Grok 1.5V's vision capabilities compared to the competition. | Image: xAI

An example shows how Grok can generate working Python code from a flowchart describing the logic of a guessing game. This demonstrates the model's ability to understand diagrams and translate them into executable code. Another example shows Grok-1.5V explaining a meme.

Image: xAI
Image: xAI

RealWorldQA: Visual benchmark for the physical world

xAI argues that understanding the physical world is essential to developing useful AI assistants for the real world. To this end, the company has developed the RealWorldQA benchmark, which evaluates the spatial capabilities of multimodal models. Many of the examples are straightforward for humans, but are often challenging for AI models.

The initial RealWorldQA dataset consists of more than 700 images, each containing a question and an easily verifiable answer. The images are taken from vehicles and other sources, and are anonymized. xAI is making the dataset available to the community for download under the CC BY-ND 4.0 license.

Recommendation
Sample images from the RealWorldQA dataset. | Image: xAI

xAI sees the advancement of multimodal understanding and generation capabilities as important steps toward a useful Artificial General Intelligence (AGI) that can understand the universe - xAI's self-proclaimed mission.

The company expects significant improvements in both areas in the coming months for various modalities such as images, audio, and video. In May, xAI reportedly plans to launch Grok-2, which Musk says will outperform GPT-4.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • With Grok-1.5V, xAI releases its first multimodal AI model that can process visual information such as documents, diagrams, graphics, screenshots, and photos in addition to text.
  • In various benchmarks such as MMLU, Mathvista, AI2D, TextVQA, ChartQA and DocVQA, Grok-1.5V is competitive with other leading multimodal models. It performs particularly well on the new RealWorldQA benchmark, which measures spatial understanding of the real world.
  • The RealWorldQA dataset consists of over 700 images with questions and answers and is designed to evaluate the ability of multimodal models to understand the physical world. It is available for download from xAI.
Sources
Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.