Researchers have built an AI model for cardiac ultrasound based on Meta's JEPA architecture that outperforms common approaches like Masked Autoencoders and contrastive learning in their benchmarks.

Most AI models for image and video analysis either reconstruct masked pixels or learn by matching image-text pairs. Both approaches dominate computer vision. An international research team from the University of Toronto, the Vector Institute, and the University of Chicago now claims a third method can beat both: the JEPA architecture proposed by Yann LeCun and his team during his time at Meta.

Their model, EchoJEPA, was trained on 18 million ultrasound videos from 300,000 patients, according to the paper. Standard approaches like Masked Autoencoders hide parts of an image and force the model to reconstruct the missing pixels as faithfully as possible. The model has to learn exactly what the image looks like, including all noise and artifacts. JEPA takes a different approach: it also masks parts of the image, but instead of trying to reconstruct the actual pixels, it predicts an abstract representation of the hidden region - essentially a compressed summary of what's there conceptually. The model doesn't need to know what a patch looks like exactly, just what it means.

Ultrasound is a stress test for vision models

Ultrasound images are full of noise. Speckle patterns, shadows, and intensity fluctuations obscure the actual cardiac anatomy. A model that has to reconstruct pixels inevitably learns this noise as well. JEPA is designed to sidestep this problem because, according to the researchers, it focuses on temporally stable structures like heart chambers and wall motion.