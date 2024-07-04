AI research
Maximilian Schreiner

Whiteboard of Thought: New method allows GPT-4o to reason with images

Midjourney prompted by THE DECODER
Whiteboard of Thought: New method allows GPT-4o to reason with images
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Profile
E-Mail
Content
summary Summary

"Whiteboard-of-Thought" enables multimodal language models to use images as intermediate steps in thinking, improving performance on tasks that require visual and spatial reasoning.

Ad

Researchers from Columbia University have developed a new technique that allows multimodal large language models (MLLMs) like OpenAI's GPT-4o to use visual intermediate steps while thinking. They call this method "Whiteboard-of-Thought" (WoT), referring to the widely used "Chain-of-Thought" (CoT) method.

While CoT prompts language models to write out intermediate steps in reasoning, WoT provides MLLMs with a metaphorical "whiteboard" where they can record the results of intermediate thinking steps as images.

To achieve this, the researchers leverage the models' ability to write code with visualization libraries like Turtle and Matplotlib. The generated code is executed to produce an image. This image is then fed back as visual input to the multimodal model to perform further steps to generate a final answer.

Ad
Ad

Whiteboard-of-Thought brings performance leaps in visual benchmarks

The researchers demonstrate the potential of this idea on three BIG-Bench tasks involving understanding ASCII art, as well as on a recently published difficult benchmark for evaluating spatial reasoning skills.

On these tasks, which have proven challenging for current models, WoT enables a significant performance boost, significantly outperforming the performance of text-only models.

The authors also conduct a detailed error analysis to understand where the method succeeds and where its limitations lie. They find that a significant portion of the remaining errors can be attributed to visual perception. With better models, the benefits of WoT will continue to increase.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Researchers at Columbia University have developed a technique called "Whiteboard-of-Thought" (WoT) that allows multimodal large language models to use images as intermediate steps in reasoning, improving their performance on tasks that require visual and spatial reasoning.
  • WoT provides models with a metaphorical “whiteboard” on which they can record the results of intermediate reasoning steps as images by generating code with visualization libraries. The generated image is then fed back to the model as visual input to perform further steps to generate a final answer.
  • The researchers demonstrate the potential of WoT with benchmarks involving understanding ASCII art and assessing spatial reasoning skills. WoT enables significant leaps in performance and significantly outperforms text-based models, with much of the remaining error due to limitations in visual perception.
Sources
Arxiv
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Profile
E-Mail
AI research

French AI lab Kyutai unveils conversational AI assistant Moshi, plans open-source release

News, tests and reports about VR, AR and MIXED Reality.
Check out these volumetric photos & videos on your Meta Quest Crazy, bizarre, chaotic: New trailer shows off the absurd gameplay of the VR multiplayer spectacle Leeroy Pico 4S could be serious competition for Quest 3 if this leaker is right MIXED-NEWS.com
AI research

Google's ImageInWords could boost everything from image search to text-to-image AI

AI in practice

AI in the film industry: "Real talent remains crucial"

Google News
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Whiteboard of Thought: New method allows GPT-4o to reason with images

Bank details

IBAN: DE87 1203 0000 1086 0070 75
Account holder: DEEP CONTENT GbR
Purpose: Support THE DECODER
AI research

French AI lab Kyutai unveils conversational AI assistant Moshi, plans open-source release

AI research

Tencent researchers unleash an army of AI-generated personas for data generation

AI research

Meta's new AI can create 3D objects from text in under a minute

Google News