Video-ChatGPT analyzes videos and explains why they might be funny

Video-ChatGPT can describe video over time, solving textual tasks such as describing safety risks in a scene, highlighting humorous aspects, or generating matching ad copy.

While companies like Runway ML are making strides in converting text to video, Video-ChatGPT goes the other way, giving a language model the ability to analyze video. Video-ChatGPT can describe the content of a video in text, for example, explaining why a clip might be funny by highlighting an unusual element.

The developers demonstrate this with a video of a giraffe jumping into the water from a diving board. "This is not a common sight, as giraffes are not known for their acrobatics skills or diving abilites," Video-ChatGPT points out.

Image: Maaz et al.

Pre-trained video encoder linked to an open-source language model

The researchers describe the design of Video-ChatGPT as simple and easily scalable. It uses a pre-trained video encoder and combines it with a pre-trained and then fine-tuned language model.

Despite its name, the project from the Mohamed bin Zayed University of Artificial Intelligence in Abu Dhabi does not use OpenAI technology. Instead, it builds on the open-source Vicuna-7B model. The researchers embedded a linear layer to connect the video encoder to the language model.

In addition to the user prompt that asks for a specific task, the language model is also prompted with a system command that defines its role and general job.

Human and machine-enhanced dataset

The researchers used a mix of human annotation and semi-automated methods to generate high-quality data to fine-tune the Vicuna model. This data ranges from detailed descriptions to creative tasks and interviews and covers a variety of different concepts.

In total, the dataset contains approximately 86,000 high-quality question-answer pairs, some annotated by humans, some annotated by GPT models, and some annotated with context from image analysis systems.

Recommendation

AI research

Automated research: The AI Scientist generates papers for 15 dollars each

The heart of Video-ChatGPT is its ability to combine video understanding and text generation. It has been extensively tested for its capabilities in video reasoning, creativity, and understanding of time and space. See more examples in the video below and in the GitHub repository.

For now, Video-ChatGPT is only available as an online demo, but the developers plan to release code and models on GitHub in the near future.

Multimodal AI future

After recent significant advances in text generation, companies like OpenAI and Google are turning to multimodal models. Bard understands and can respond to images, and GPT-4 demonstrated these capabilities at its official launch, although OpenAI has not yet released it.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Going from images to moving images would be the next logical step. Google has already announced the development of a large multimodal AI model with Project Gemini, to be released later this year.

Video-ChatGPT analyzes videos and explains why they might be funny

Pre-trained video encoder linked to an open-source language model

Human and machine-enhanced dataset

Automated research: The AI Scientist generates papers for 15 dollars each

Multimodal AI future

Meta refuses to sign EU's AI Code of Practice, citing legal uncertainty

OpenAI CEO Sam Altman warns users not to trust ChatGPT agent with sensitive or personal data

OpenAI launches new ChatGPT agent that automates complex tasks for Pro, Plus, and Team

OpenAI launches new ChatGPT agent that automates complex tasks for Pro, Plus, and Team

Kimi-K2 is the next open-weight AI milestone from China after Deepseek

New Energy-Based Transformer architecture aims to bring better "System 2 thinking" to AI models

Video-ChatGPT analyzes videos and explains why they might be funny

Pre-trained video encoder linked to an open-source language model

Human and machine-enhanced dataset

Multimodal AI future

Share

Bank details