ChatGPT's training method could advance computer vision

OpenAI's ChatGPT benefits from training with human feedback. Google is now looking at how this method can also improve computer vision models.

The first deep-learning models for object recognition were based on supervised learning with a massive number of labeled images. For example, an image of a cat comes with the digital label "cat". In this way, the system learns the relationship between the word and the image.

With the advent and success of the Transformer architecture in language processing, researchers began to successfully apply Transformers and self-supervised learning to computer vision.

Only training with labeled images became obsolete: Like text models, image models learned from large amounts of unstructured data. Google's Vision Transformer was one of the first architectures to reach the level of older, supervised models.

Reinforcement learning can improve pre-trained AI models

Following the release of the large language model GPT-2, OpenAI began to experiment with training language models using reinforcement learning with human feedback (RLHF). This is because large, pre-trained language models, while extremely versatile, are difficult to control - a fact that Microsoft is currently experiencing with its Bing chatbot.

RLHF, on the other hand, uses reward signals to teach a large language model which text generations are desirable and which are incorrect or undesirable. As ChatGPT shows, this results in a more aligned model and appears to have a positive effect on the overall performance of the system.

Google researchers have now taken this finding and tested whether computer vision models can also benefit from reinforcement learning (RL) with reward signals. The team trained several vision transformer models and then used a simple reinforcement learning algorithm to optimize them for specific tasks such as object recognition, panoptic segmentation or image coloring.

Computer vision models also benefit from reward signals

The team shows that the object recognition and panoptic segmentation models improved with RL are on par with models specialized for these tasks. The colorization model also performs better thanks to RL.

For the three computer vision applications tested, the team shows significant improvements achieved through reinforcement learning. | Image: Pinto, Kolesnikov et al.

As a proof-of-concept study, Google's work shows that finetuning computer vision models via reinforcement learning works similarly to natural language processing, and can lead to better models.

Recommendation

AI research

AI models might need to scale down to scale up again

The next step would be to combine these reward signals with human feedback - as is the case with ChatGPT. The researchers see this as a promising area of research and would like to apply RLHF in computer vision to challenging tasks such as tuning scene understanding outputs for robot grasping. Here, RLHF could enable better perception models and raise the probability of a successful grasp.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

ChatGPT's training method could advance computer vision

Reinforcement learning can improve pre-trained AI models

Computer vision models also benefit from reward signals

AI models might need to scale down to scale up again

AVIS showcases Google's progress in AI-powered image search

OpenAI launches new ChatGPT agent that automates complex tasks for Pro, Plus, and Team

Kimi-K2 is the next open-weight AI milestone from China after Deepseek

New Energy-Based Transformer architecture aims to bring better "System 2 thinking" to AI models

ChatGPT's training method could advance computer vision

Reinforcement learning can improve pre-trained AI models

Computer vision models also benefit from reward signals

AI models might need to scale down to scale up again

AVIS showcases Google's progress in AI-powered image search