Content
summary Summary

Google is about to release Gemini, a multimodal language model that could challenge OpenAI's GPT-4. But OpenAI is reportedly preparing a fitting response.

According to The Information, OpenAI is preparing to roll out GPT-4's image understanding feature. This multimodal capability was introduced at GPT-4's unveiling and was a major aspect of the presentation. A demo in which GPT-4 generated web page code based solely on a drawing of a web page generated some buzz.

However, aside from a few test users and select applications, this image understanding is not yet available to regular GPT-4 users. According to a July report in the New York Times, OpenAI is concerned that the feature could be misused for things like facial recognition. OpenAI has blocked descriptions of people in the "Be my Eyes" visual aid app.

OpenAI's answers to Gemini: GPT-Vision and Gobi

With Google's planned release of its Gemini multimodal AI model this fall, OpenAI is reportedly looking to make image understanding more widely available under the name "GPT-Vision". This would open up a number of new image-based applications for GPT-4, such as generating text to match images.

Ad
Ad

There are also rumors that DALL-E 3, which is also under development, could be integrated into ChatGPT or GPT-4. A combination of the new image tool with the announcement of GPT-Vision seems plausible. The first images presumably generated by DALL-E 3 show great progress compared to DALL-E 2.

The new features for GPT-4 could be unveiled at OpenAI's one-day developer conference on November 6. OpenAI CEO Sam Altman announced that there will be "great stuff," though not as big as GPT-4.5 or GPT-5.

Specifically, the company is talking about "new tools" it wants to show. So GPT-Vision and DALL-E 3 would still be in the running, and the timing would be right to counter Google. DALL-E 2 needs an update anyway, as tools like Midjourney, Stable Diffusion, or Ideogram are way ahead.

According to The Information, OpenAI is also considering a new AI model, codenamed "Gobi," which is designed to be multimodal from the start. Training for Gobi has not yet begun, and there is a chance it could become GPT-5.

Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • OpenAI is getting ready to introduce GPT-4's image understanding, a multimodal feature called "GPT-Vision", to enable a broader application of the model to compete with Google's planned Gemini multimodal AI model.
  • This feature could enable new image-based applications for GPT-4, such as generating text that matches images.
  • OpenAI is also working on a new AI model, codenamed "Gobi," which is designed to be multimodal from the start and could potentially become GPT-5.
Sources
Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.