How one simple metric could change computer vision forever

DALL-E prompted by THE DECODER

AI models break benchmark after benchmark in computer vision - but in the real world, they continue to show weaknesses and lag behind humans. Why is that?

MIT researchers show that current training datasets often use images that depict objects so clearly that humans - and machines - can easily recognize them. But what constitutes a "difficult" image? The researchers suggest using the time it takes a human to identify an object in an image as a measure.

The Minimum Viewing Time (MVT) metric developed by the team is designed to quantify the difficulty of recognizing an image. The researchers used a subset of the ImageNet and ObjectNet datasets to show images for different durations, ranging from 17 milliseconds to 10 seconds, and asked participants to select the correct object from 50 options. After more than 200,000 runs, the researchers found that the test sets were biased toward simpler, shorter MVT images, so that most of the benchmark performance came from images that were easy for humans to recognize.

The team also showed that larger models like the Vision Transformer performed better on simpler images than smaller models, but made less progress on more difficult images.

"Minimum viewing time" could enable more robust AI models

Co-author Jesse Cummings emphasizes the importance of MVT for evaluating AI models: "We want models that are able to recognize any image even if — perhaps especially if — it’s hard for a human to recognize. We’re the first to quantify what this would mean."

Mayo and his team are also investigating the neurological basis of visual recognition, looking at whether the brain shows different activity when processing simple and difficult images.

"This comprehensive approach addresses the long-standing challenge of objectively assessing progress towards human-level performance in object recognition and opens new avenues for understanding and advancing the field," says co-author David Mayo. The ability to use MVT as a metric of task difficulty for many different computer vision tasks could pave the way for more robust and human-like performance in object recognition.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

How one simple metric could change computer vision forever

"Minimum viewing time" could enable more robust AI models

Researchers introduce COLORBENCH to test color understanding in vision-language models

Deepseek's Janus Pro is a good upgrade, but it won't fuel a US AI 'Sputnik crisis'

Qwen's open-source QVQ rivals OpenAI and Google's best models in visual reasoning

Cloudflare CEO Matthew Prince sees trouble ahead for the open web

New Othello experiment supports the world model hypothesis for large language models

ChatGPT might be draining your brain, MIT warns - what ‘cognitive debt’ means for you

How one simple metric could change computer vision forever

"Minimum viewing time" could enable more robust AI models

Researchers introduce COLORBENCH to test color understanding in vision-language models

Deepseek's Janus Pro is a good upgrade, but it won't fuel a US AI 'Sputnik crisis'

Qwen's open-source QVQ rivals OpenAI and Google's best models in visual reasoning