summary Summary

OpenAI's ChatGPT benefits from training with human feedback. Google is now looking at how this method can also improve computer vision models.

The first deep-learning models for object recognition were based on supervised learning with a massive number of labeled images. For example, an image of a cat comes with the digital label "cat". In this way, the system learns the relationship between the word and the image.

With the advent and success of the Transformer architecture in language processing, researchers began to successfully apply Transformers and self-supervised learning to computer vision.

Only training with labeled images became obsolete: Like text models, image models learned from large amounts of unstructured data. Google's Vision Transformer was one of the first architectures to reach the level of older, supervised models.


Reinforcement learning can improve pre-trained AI models

Following the release of the large language model GPT-2, OpenAI began to experiment with training language models using reinforcement learning with human feedback (RLHF). This is because large, pre-trained language models, while extremely versatile, are difficult to control - a fact that Microsoft is currently experiencing with its Bing chatbot.

RLHF, on the other hand, uses reward signals to teach a large language model which text generations are desirable and which are incorrect or undesirable. As ChatGPT shows, this results in a more aligned model and appears to have a positive effect on the overall performance of the system.

Google researchers have now taken this finding and tested whether computer vision models can also benefit from reinforcement learning (RL) with reward signals. The team trained several vision transformer models and then used a simple reinforcement learning algorithm to optimize them for specific tasks such as object recognition, panoptic segmentation or image coloring.

Computer vision models also benefit from reward signals

The team shows that the object recognition and panoptic segmentation models improved with RL are on par with models specialized for these tasks. The colorization model also performs better thanks to RL.

For the three computer vision applications tested, the team shows significant improvements achieved through reinforcement learning. | Image: Pinto, Kolesnikov et al.

As a proof-of-concept study, Google's work shows that finetuning computer vision models via reinforcement learning works similarly to natural language processing, and can lead to better models.


The next step would be to combine these reward signals with human feedback - as is the case with ChatGPT. The researchers see this as a promising area of research and would like to apply RLHF in computer vision to challenging tasks such as tuning scene understanding outputs for robot grasping. Here, RLHF could enable better perception models and raise the probability of a successful grasp.

Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
  • Google researchers show that reinforcement learning can be successfully used to finetune computer vision models.
  • The team believes that RL training with human feedback, which has been so successful with ChatGPT, also holds promise for Computer Vision.
  • The methods could, for example, improve a robot's understanding of a scene, enabling it to grasp objects more successfully.
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.