Content
summary Summary

To mark the 10th anniversary of Meta's Fundamental AI Research (FAIR) team, the company presents three new research projects: Ego-Exo4D, Seamless Communication, and Audiobox.

Ego-Exo4D is a dataset and benchmark set to support AI research in video learning and multimodal perception. Collected over two years by Metas FAIR, Project Aria, and 15 university partners from around the world, Ego-Exo4D captures both “egocentric” views from the camera of a participant wearing the Project Aria headset and “exocentric” views from surrounding cameras.

The dataset focuses on complex human activities such as sports, music, cooking, dancing, and bicycle repair.

Video: Meta

Ad
Ad

Meta sees applications in augmented reality (AR) systems, where a person wearing smart headsets could quickly learn new skills with the help of a virtual AI trainer guiding them through an instructional video; in robotic learning, where a robot observing people around it could learn new handling skills with less physical experience; or in social networks, where new communities could emerge based on people sharing their knowledge and complementary skills in videos.

The dataset of over 1,400 hours of video will be available as open source in December, and a public benchmark competition for Ego-Exo4D is planned for next year.

Seamless Communication aims to enable expressive and fast AI translations

After the Seamless Communication project presented the SeamlessM4T multimodal translation model in August, FAIR is now presenting a family of AI research models that build on the old model to enable more natural and authentic communication across language boundaries.

The project consists of four models

- SeamlessExpressive: preserves the expression and nuance of speech across language boundaries.
- SeamlessStreaming: Delivers speech and text translations with a latency of approximately two seconds.
- SeamlessM4T v2: A multilingual and multitasking model for effortless voice and text communication.
- Seamless: Combines the capabilities of SeamlessExpressive, SeamlessStreaming, and SeamlessM4T v2 in a single model.

Recommendation

Video: Meta

Meta also published a demo of SeamlessExpressive, where you can have your voice translated.

Audiobox is a generative AI model for audio

Audiobox is Meta's new audio generation model. It is capable of generating voices and sound effects through a combination of voice input and natural language text prompts, making it easier to create custom audio files for different use cases.

Compared to its direct predecessor, Voicebox, Audiobox offers improved controllability by allowing users to use natural language prompts to create a desired sound or type of speech.

Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Video: Meta

The model will initially be made available to a select group of researchers and academic institutions to advance the state of the art in audio generation research and ensure the responsible development of artificial intelligence, Meta said.

Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Meta presents three new AI research projects: Ego-Exo4D, a dataset to support video learning and multimodal perception research; Seamless Communication, a family of AI models to enhance natural and authentic communication across language boundaries; and Audiobox, a generative AI model for creating voices and sound effects.
  • Ego-Exo4D focuses on complex human activities and has applications in augmented reality, robotics and social networking. The dataset will be available as open source in December, and a benchmark competition is planned for next year.
  • Seamless Communication consists of four models that enable expression, nuance, and rapid translation across language boundaries, while Audiobox allows users to create custom audio files with natural language text prompts.
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.