Content
summary Summary

Meta has released SAM 2, a new foundation model for image and video segmentation. The company is open-sourcing the model, code, and dataset.

Ad

In April 2023, Meta introduced its "Segment Anything Model" (SAM), a foundation model for image segmentation that was hailed as the "GPT-3 moment" for computer vision due to its significant quality improvements.

Now Meta is following up with SAM 2. While SAM was trained on 11 million images and optimized for image segmentation, SAM 2 was trained on video data and can segment both images and video. According to Meta, video segmentation works reliably even with lower quality video or when objects are partially obscured.

Occlusion has long been a challenging problem in machine vision, but appears to be largely solved now. In this example, this means that only the part of the boy that is visible behind the tree is tracked. | Video: Meta

Ad
Ad

SAM 2 was trained on the new SA-V (Segment Anything Video) dataset, the largest publicly available video segmentation dataset to date. Meta says SA-V contains 50,900 videos with 642,600 mask annotations, comprising 35.5 million individual masks - 53 times more than previous datasets. With nearly 200 hours of annotated video, SA-V sets a new benchmark for training data.

The two SAM systems helped create the dataset: Meta used a "Data Engine" annotation system with human annotators, which can label videos up to 8.4 times faster than other systems thanks to the "SAM model in the loop." The annotators used SAM 2 interactively to label video segments, and the new data was used to update SAM 2.

Architecturally, SAM 2 builds on its Transformer-based predecessor. A new feature is a memory module that stores information about objects and previous interactions across video frames. This allows SAM 2 to track objects over longer sequences and respond to user input. When applied to images, the memory is empty and the model behaves like SAM.

Video: Meta

SAM 2 sets new standards in video and image segmentation

In experiments, SAM 2 demonstrated better segmentation accuracy with three times fewer interactions compared to previous approaches. Meta says the new model outperforms the current state-of-the-art on established video object segmentation benchmarks. SAM 2 also achieved better results in image segmentation than the original SAM model, at six times the speed.

Recommendation

According to Meta, SAM 2 should be robust regarding skin color and age, with minimal fluctuations between genders. The inference speed is 44 frames per second, approaching real-time performance.

SAM 2 isn't perfect. It can struggle with losing objects after scene cuts or long occlusions, segmenting very fine details, and tracking individual objects within groups of similar, moving objects. Explicit modeling of movement could help address this, the researchers note.

If there are many similar, moving objects in the image, the segmentation, and tracking of SAM 2 can slip. | Video: Meta

The researchers see SAM 2 as an important advance in machine vision that could drive further research and applications, such as robots that can move and interact with the real world more reliably, or enabling video effects in generative AI video models.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Meta releases the SAM 2 model, code, and weights as open source under the Apache 2.0 license. The SA-V dataset is released under the CC-by-4.0 license. SAM 2 can be tested in an interactive demo.

Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Meta releases its new open-source model SAM 2, which, unlike its predecessor SAM, can also segment videos and achieves better results in image segmentation.
  • SAM 2 was trained on the largest publicly available video segmentation dataset to date, SA-V, which contains 50,900 videos with a total of 642,600 mask annotations and was created using a fast annotation system and the SAM models.
  • In experiments, SAM 2 showed better segmentation accuracy with 3x fewer interactions than previous approaches and outperformed the current state of the art on established benchmarks. It still has limitations in fine detail and object tracking in complex scenes.
Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.