summary Summary

ChatGPT will be getting new features in the coming weeks to make interaction more intuitive. Some of these are previously announced speech and image recognition features.

OpenAI is extending ChatGPT's speech capabilities first in its apps for iOS and Android. You could already use voice chat instead of typing based on OpenAI's open-source Whisper model, but it was one-way. Now ChatGPT can also respond by voice. Although Whisper works in languages other than English, OpenAI discourages its use in other languages - especially those without Latin characters.

OpenAI has developed its own text-to-speech model

For voice output, OpenAI has developed its own text-to-speech model, which is also being adapted by Spotify. This makes OpenAI a solution provider in this area as well, competing with startups like Elevenlabs that focus on synthetic voices.

OpenAI's AI voice is capable of generating human-sounding synthetic voices in the style of the original voice from text with just a few seconds of sample audio. For the five ChatGPT voices, OpenAI worked with professional voice actors.


Watch the video: OpenAI

In addition to ChatGPT, the text-to-speech model is also used by Spotify, as mentioned above. The Swedish music streaming service uses it to translate podcasts into other languages in the voice of the podcast host. Spotify has released the first examples in Spanish. French and German will follow in the coming days and weeks.

  1. Lex Fridman Podcast - "Interview with Yuval Noah Harari".
  2. Armchair Expert - "Kristen Bell, by the grace of god, returns"
  3. The Diary of a CEO with Steven Bartlett - "Interview with Dr. Mindy Pelz"

Video: Spotify

OpenAI has not yet announced whether other companies or individuals will have access to the new text-to-speech model. However, the announcement notes that due to the risk of abuse by voice clones, a controlled rollout will initially take place for selected usage scenarios - such as voice chat and Spotify podcasts.

Voice can be combined with another ChatGPT innovation, the ability to recognize and talk about content in images. This feature was announced at the launch of GPT-4 and is now being rolled out.


As a practical example of multimodal prompts, OpenAI cites the ability to show ChatGPT a photo of a landmark and talk about it while traveling. Another example shows how ChatGPT can use pictures to help maintain a bicycle.

Video: OpenAI

The new image recognition, which works with GPT-3.5 and GPT-4, should also make ChatGPT useful for everyday questions. For example, according to the blog post, you can take pictures of your fridge and pantry and ChatGPT will suggest recipe ideas. It doesn't matter if you upload photos or graphics with text, ChatGPT can handle all image formats.

On the smartphone, you can mark an area in an image that ChatGPT should focus on. If and when this feature will come to the web interface is not known. It would make sense since ChatGPT-Enterprise in particular is likely to be used primarily by business users on large screens. According to OpenAI, the new features will be introduced gradually.

Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

OpenAI warns against using GPT-4V in sensitive scenarios

In the system card for GPT-4V(ision), OpenAI describes the model's image recognition as "unreliable". For example, when analyzing chemical structures, it misidentified substances such as fentanyl, carfentanil, and cocaine, but sometimes correctly identified dangerous components such as poisonous mushrooms.

The model was prone to hallucinations and could present incorrect facts in an authoritative tone. "This demonstrates that the model is unreliable and should not be used for any high-risk tasks such as identification of dangerous compounds or foods," the paper says.

Here, OpenAI specifically warns against using GPT-4V in a scientific and medical context. The company also provides examples of the model refusing to answer to avoid providing feedback on possible biases.

Image: OpenAI

OpenAI limits ChatGPT's image analysis for humans

Much of OpenAI's announcement of the new ChatGPT features revolves around the promise of developing safe and useful AI. The time since the unveiling of GPT-4 and its associated image recognition has been used for intensive testing, the company says. Nevertheless, hallucinations could not be ruled out.

"We’ve also taken technical measures to significantly limit ChatGPT’s ability to analyze and make direct statements about people since ChatGPT is not always accurate and these systems should respect individuals’ privacy," OpenAI writes. However, real-world use helps improve those protections, it adds.

Previously, there were reports that OpenAI was concerned that ChatGPT's image understanding could be misused as a facial recognition system and should be restricted. The "Be My Eyes" app, which describes the environment for the visually impaired, disabled facial recognition months ago. In the system card for GPT-4V(ision), OpenAI writes that it is working on a feature that can describe faces without identifying people.

Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
  • OpenAI will be adding new speech and image recognition capabilities to ChatGPT in the coming weeks, allowing for more multimodal and interactive interaction.
  • The iOS and Android apps will feature voice output, with voice input already available. OpenAI has developed a proprietary model for creating human-sounding voices from text and voice clones based on short sound samples. The model is also used by Spotify to generate multilingual podcasts.
  • New image recognition for GPT-3.5 and GPT-4 allows voice and images to be combined in a single prompt. For example, ChatGPT users can show a photo of a landmark and talk about it while traveling, or get meal suggestions based on photos of the refrigerator and pantry.
Jonathan works as a technology journalist who focuses primarily on how easily AI can already be used today and how it can support daily life.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.