Content
summary Summary

Researchers are investigating whether multimodality makes AI models more powerful, even when the data is not directly related.

Multimodal AI models, such as Google's Gemini, can process text, images, and sound. Unlike such models, which often use paired data from different modalities, such as images and associated text descriptions, Multimodal Pathway focuses on scenarios where the data sets come from different modalities but have no direct relevance to each other.

Multimodal Pathway Transformer finds positive effect

The team from the Chinese University of Hong Kong and the Tencent AI Lab specifically investigated whether the performance of AI models for one modality, such as image recognition, improves when data from another, actually irrelevant, modality, such as audio or point clouds, is also used.

For this purpose, the researchers developed the Multimodal Pathway Transformer (M2PT), in which a specific tokenizer and a specific head for the target modality are linked to transformer blocks of an auxiliary model trained with data from another modality via "cross-modal re-parameterization".

Ad
Ad

Cross-modal re-parameterization is a method in which each linear layer in the transformer blocks of the target model is linked to its counterpart in the auxiliary model. The outputs of both layers are added together. This approach incurs little additional training cost and no additional inference cost, making it attractive for practical application.

The application of the multimodal pathway approach led to significant and consistent performance improvements across different modalities. Practical experiments conducted by the developers led to performance improvements in image, point cloud, video, and audio recognition.

AI model benefits from complementary knowledge

Why does it work? The researchers suggest that the model trained on data from one modality has encoded knowledge that can benefit another model's process whose input sequences come from a different modality. This "modality-complementary knowledge" seems to exist and to be transferable, the team says, even if the data between modalities is irrelevant.

Nevertheless, a theoretical foundation for the observed improvements remains open. This could lead to a deeper understanding of the mechanism and neural networks in general and is, according to the team, a topic for future research.

Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Researchers from the Chinese University of Hong Kong and Tencent AI Lab investigated whether multimodality can improve the performance of AI models, even when data from different modalities are not directly linked.
  • They developed the Multimodal Pathway Transformer (M2PT), which links data from different modalities via "cross-modal re-parameterization," and showed significant performance improvements in image, point cloud, video, and audio recognition.
  • The researchers hypothesize that the AI model benefits from complementary knowledge encoded in different modalities, even when the data between modalities is irrelevant. However, a theoretical justification for these improvements is still open and subject to future research.
Sources
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.