Google is about to release Gemini, a multimodal language model that could challenge OpenAI's GPT-4. But OpenAI is reportedly preparing a fitting response.
According to The Information, OpenAI is preparing to roll out GPT-4's image understanding feature. This multimodal capability was introduced at GPT-4's unveiling and was a major aspect of the presentation. A demo in which GPT-4 generated web page code based solely on a drawing of a web page generated some buzz.
However, aside from a few test users and select applications, this image understanding is not yet available to regular GPT-4 users. According to a July report in the New York Times, OpenAI is concerned that the feature could be misused for things like facial recognition. OpenAI has blocked descriptions of people in the "Be my Eyes" visual aid app.
OpenAI's answers to Gemini: GPT-Vision and Gobi
With Google's planned release of its Gemini multimodal AI model this fall, OpenAI is reportedly looking to make image understanding more widely available under the name "GPT-Vision". This would open up a number of new image-based applications for GPT-4, such as generating text to match images.
There are also rumors that DALL-E 3, which is also under development, could be integrated into ChatGPT or GPT-4. A combination of the new image tool with the announcement of GPT-Vision seems plausible. The first images presumably generated by DALL-E 3 show great progress compared to DALL-E 2.
The new features for GPT-4 could be unveiled at OpenAI's one-day developer conference on November 6. OpenAI CEO Sam Altman announced that there will be "great stuff," though not as big as GPT-4.5 or GPT-5.
Specifically, the company is talking about "new tools" it wants to show. So GPT-Vision and DALL-E 3 would still be in the running, and the timing would be right to counter Google. DALL-E 2 needs an update anyway, as tools like Midjourney, Stable Diffusion, or Ideogram are way ahead.
According to The Information, OpenAI is also considering a new AI model, codenamed "Gobi," which is designed to be multimodal from the start. Training for Gobi has not yet begun, and there is a chance it could become GPT-5.