Video-ChatGPT analyzes videos and explains why they might be funny

Video-ChatGPT can describe video over time, solving textual tasks such as describing safety risks in a scene, highlighting humorous aspects, or generating matching ad copy.

While companies like Runway ML are making strides in converting text to video, Video-ChatGPT goes the other way, giving a language model the ability to analyze video. Video-ChatGPT can describe the content of a video in text, for example, explaining why a clip might be funny by highlighting an unusual element.

The developers demonstrate this with a video of a giraffe jumping into the water from a diving board. "This is not a common sight, as giraffes are not known for their acrobatics skills or diving abilites," Video-ChatGPT points out.

Image: Maaz et al.

Pre-trained video encoder linked to an open-source language model

The researchers describe the design of Video-ChatGPT as simple and easily scalable. It uses a pre-trained video encoder and combines it with a pre-trained and then fine-tuned language model.

Despite its name, the project from the Mohamed bin Zayed University of Artificial Intelligence in Abu Dhabi does not use OpenAI technology. Instead, it builds on the open-source Vicuna-7B model. The researchers embedded a linear layer to connect the video encoder to the language model.

In addition to the user prompt that asks for a specific task, the language model is also prompted with a system command that defines its role and general job.

Human and machine-enhanced dataset

The researchers used a mix of human annotation and semi-automated methods to generate high-quality data to fine-tune the Vicuna model. This data ranges from detailed descriptions to creative tasks and interviews and covers a variety of different concepts.

In total, the dataset contains approximately 86,000 high-quality question-answer pairs, some annotated by humans, some annotated by GPT models, and some annotated with context from image analysis systems.

Recommendation

AI research

OpenAI's o3 is less AGI than originally measured

The heart of Video-ChatGPT is its ability to combine video understanding and text generation. It has been extensively tested for its capabilities in video reasoning, creativity, and understanding of time and space. See more examples in the video below and in the GitHub repository.

For now, Video-ChatGPT is only available as an online demo, but the developers plan to release code and models on GitHub in the near future.

Multimodal AI future

After recent significant advances in text generation, companies like OpenAI and Google are turning to multimodal models. Bard understands and can respond to images, and GPT-4 demonstrated these capabilities at its official launch, although OpenAI has not yet released it.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Going from images to moving images would be the next logical step. Google has already announced the development of a large multimodal AI model with Project Gemini, to be released later this year.

Video-ChatGPT analyzes videos and explains why they might be funny

Pre-trained video encoder linked to an open-source language model

Human and machine-enhanced dataset

OpenAI's o3 is less AGI than originally measured

Multimodal AI future

OpenAI to tap 4.5 GW of Oracle data center power for Stargate AI project

OpenAI is charging at least $10 million per client for its enterprise AI consulting services

Apple weighs abandoning its own AI for Siri as it tests models from OpenAI and Anthropic

Cloudflare CEO Matthew Prince sees trouble ahead for the open web

New Othello experiment supports the world model hypothesis for large language models

ChatGPT might be draining your brain, MIT warns - what ‘cognitive debt’ means for you

Video-ChatGPT analyzes videos and explains why they might be funny

Pre-trained video encoder linked to an open-source language model

Human and machine-enhanced dataset

Multimodal AI future

Share

Bank details