Researchers are investigating whether multimodality makes AI models more powerful, even when the data is not directly related.
Multimodal AI models, such as Google's Gemini, can process text, images, and sound. Unlike such models, which often use paired data from different modalities, such as images and associated text descriptions, Multimodal Pathway focuses on scenarios where the data sets come from different modalities but have no direct relevance to each other.
Multimodal Pathway Transformer finds positive effect
The team from the Chinese University of Hong Kong and the Tencent AI Lab specifically investigated whether the performance of AI models for one modality, such as image recognition, improves when data from another, actually irrelevant, modality, such as audio or point clouds, is also used.
For this purpose, the researchers developed the Multimodal Pathway Transformer (M2PT), in which a specific tokenizer and a specific head for the target modality are linked to transformer blocks of an auxiliary model trained with data from another modality via "cross-modal re-parameterization".
Cross-modal re-parameterization is a method in which each linear layer in the transformer blocks of the target model is linked to its counterpart in the auxiliary model. The outputs of both layers are added together. This approach incurs little additional training cost and no additional inference cost, making it attractive for practical application.
The application of the multimodal pathway approach led to significant and consistent performance improvements across different modalities. Practical experiments conducted by the developers led to performance improvements in image, point cloud, video, and audio recognition.
AI model benefits from complementary knowledge
Why does it work? The researchers suggest that the model trained on data from one modality has encoded knowledge that can benefit another model's process whose input sequences come from a different modality. This "modality-complementary knowledge" seems to exist and to be transferable, the team says, even if the data between modalities is irrelevant.
Nevertheless, a theoretical foundation for the observed improvements remains open. This could lead to a deeper understanding of the mechanism and neural networks in general and is, according to the team, a topic for future research.