xAI introduces Grok-1.5 Vision multimodal AI model and a physical world benchmark

Midjourney prompted by THE DECODER

xAI releases preview of Grok-1.5 Vision multimodal AI model with improved understanding of the physical world.

Elon Musk's AI startup xAI has released a preview of Grok-1.5 Vision, its first multimodal AI model, which the company says has a better understanding of the physical world than its competitors.

In addition to standard text capabilities, Grok-1.5V can process various visual information, including documents, diagrams, graphics, screenshots, and photos. The model will soon be available to early testers and current Grok users.

xAI claims that Grok-1.5V is competitive with today's best multimodal models in several areas, from multidisciplinary reasoning to understanding documents, scientific diagrams, graphics, screenshots, and photos.

The company has published a table comparing Grok's performance on various benchmarks against competing models such as OpenAI's GPT-4, Anthropic's Claude, and Gemini Pro. Grok achieved equal or better results in most areas.

xAI highlights Grok's ability to understand the physical world as measured by the newly introduced RealWorldQA benchmark, which tests spatial understanding of the real world. Grok outperformed competitors using zero-shot prompts (no example solutions given in the prompt).

Grok 1.5V's vision capabilities compared to the competition. | Image: xAI

An example shows how Grok can generate working Python code from a flowchart describing the logic of a guessing game. This demonstrates the model's ability to understand diagrams and translate them into executable code. Another example shows Grok-1.5V explaining a meme.

RealWorldQA: Visual benchmark for the physical world

xAI argues that understanding the physical world is essential to developing useful AI assistants for the real world. To this end, the company has developed the RealWorldQA benchmark, which evaluates the spatial capabilities of multimodal models. Many of the examples are straightforward for humans, but are often challenging for AI models.

The initial RealWorldQA dataset consists of more than 700 images, each containing a question and an easily verifiable answer. The images are taken from vehicles and other sources, and are anonymized. xAI is making the dataset available to the community for download under the CC BY-ND 4.0 license.

Recommendation

AI in practice

OpenAI launches GPT-4.1: New model family to improve agents, long contexts and coding

Sample images from the RealWorldQA dataset. | Image: xAI

xAI sees the advancement of multimodal understanding and generation capabilities as important steps toward a useful Artificial General Intelligence (AGI) that can understand the universe - xAI's self-proclaimed mission.

The company expects significant improvements in both areas in the coming months for various modalities such as images, audio, and video. In May, xAI reportedly plans to launch Grok-2, which Musk says will outperform GPT-4.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

xAI introduces Grok-1.5 Vision multimodal AI model and a physical world benchmark

RealWorldQA: Visual benchmark for the physical world

OpenAI launches GPT-4.1: New model family to improve agents, long contexts and coding

Elon Musk wants to rewrite "the entire corpus of human knowledge" with Grok

xAI reverses Grok's prompt changes after racist responses

Elon Musk claims xAI's Grok 3 is the "smartest AI on Earth" ahead of Monday live demo

Cloudflare CEO Matthew Prince sees trouble ahead for the open web

New Othello experiment supports the world model hypothesis for large language models

ChatGPT might be draining your brain, MIT warns - what ‘cognitive debt’ means for you

xAI introduces Grok-1.5 Vision multimodal AI model and a physical world benchmark

RealWorldQA: Visual benchmark for the physical world

Share

Bank details