Content
summary Summary

Two new papers examine the visual capabilities of Google Gemini Pro and GPT-4 vision. Both models are on par, with slight advantages for GPT-4.

Two new papers from Tencent Youtu Lab, the University of Hong Kong, and numerous other universities and institutes comprehensively compare the visual capabilities of Google's Gemini Pro and OpenAI's GPT-4V, currently the most capable multimodal language models (MLLMs).

The research focuses on the specific strengths and capabilities of each model and gives a detailed comparison across multiple dimensions. These include image recognition, text recognition in images, image inferencing, text inferencing in images, integrated image and text understanding, object localization, temporal video understanding, and multilanguage capability.

GPT-4V and Gemini Pro are on par when it comes to visual comprehension and reasoning

Both models showed comparable performance on basic image recognition tasks. They can extract text from images, but need improvement in areas such as recognizing complex formulas, as one of the two papers shows.

Ad
Ad
Image: Qi et al.

In image understanding, both models showed good common-sense reasoning. However, Gemini performed slightly worse than GPT-4V on a pattern search test (IQ tests).

Image: Fu et al.

Both models also showed a good understanding of humor, emotion, and aesthetic judgment (EQ tests).

Image: Qi et al.

In terms of text comprehension, Gemini showed some poorer performance on complex tabular reasoning and mathematical problem-solving tasks compared to GPT-4V. Google's larger model, Gemini Ultra, could exhibit greater improvements here.

MME benchmark results. | Image: Fu et al.

In terms of the level of detail and accuracy of the responses, the research teams made exactly opposite observations: One group attributed particularly detailed or concise responses to Gemini, the other to GPT-4V. Gemini would add relevant images and links.

In terms of commercial applications, Gemini was outperformed by GPT-4V in the areas of embodied agent and GUI navigation. Gemini, in turn, is said to have advantages in multimodal reasoning capability.

Recommendation

Both research teams conclude that Gemini and GPT-4V are capable and impressive multimodal AI models. In terms of overall performance, GPT-4V is rated as slightly more capable than Gemini Pro. Gemini Ultra and GPT-4.5 could bring further improvements.

However, both Gemini and GPT-4V still have weaknesses in spatial visual understanding, handwriting recognition, logical reasoning problems in inferring responses, and robustness of prompts. The road to multimodal general AI is still a long one, one paper concludes.

You can find many more comparisons and examples of the image analysis capabilities of GPT-4V and Gemini Pro in the scientific papers linked below.

Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Two new research papers examine the visual capabilities of Google's Gemini Pro and OpenAI's GPT-4V, currently the most capable multimodal language models. They show that both models perform comparably.
  • The models were tested in areas such as image recognition, text recognition in images, image and text understanding, object localization, and multilingual capabilities, with GPT-4V rated slightly more powerful overall.
  • However, both models have room for improvement in visual comprehension, logical reasoning, and robustness of prompts. The road to multimodal general-purpose AI is still a long one, one paper concludes.
Sources
Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.