Meta has released SAM 2, a new foundation model for image and video segmentation. The company is open-sourcing the model, code, and dataset.
In April 2023, Meta introduced its "Segment Anything Model" (SAM), a foundation model for image segmentation that was hailed as the "GPT-3 moment" for computer vision due to its significant quality improvements.
Now Meta is following up with SAM 2. While SAM was trained on 11 million images and optimized for image segmentation, SAM 2 was trained on video data and can segment both images and video. According to Meta, video segmentation works reliably even with lower quality video or when objects are partially obscured.
SAM 2 was trained on the new SA-V (Segment Anything Video) dataset, the largest publicly available video segmentation dataset to date. Meta says SA-V contains 50,900 videos with 642,600 mask annotations, comprising 35.5 million individual masks - 53 times more than previous datasets. With nearly 200 hours of annotated video, SA-V sets a new benchmark for training data.
The two SAM systems helped create the dataset: Meta used a "Data Engine" annotation system with human annotators, which can label videos up to 8.4 times faster than other systems thanks to the "SAM model in the loop." The annotators used SAM 2 interactively to label video segments, and the new data was used to update SAM 2.
Architecturally, SAM 2 builds on its Transformer-based predecessor. A new feature is a memory module that stores information about objects and previous interactions across video frames. This allows SAM 2 to track objects over longer sequences and respond to user input. When applied to images, the memory is empty and the model behaves like SAM.
SAM 2 sets new standards in video and image segmentation
In experiments, SAM 2 demonstrated better segmentation accuracy with three times fewer interactions compared to previous approaches. Meta says the new model outperforms the current state-of-the-art on established video object segmentation benchmarks. SAM 2 also achieved better results in image segmentation than the original SAM model, at six times the speed.
According to Meta, SAM 2 should be robust regarding skin color and age, with minimal fluctuations between genders. The inference speed is 44 frames per second, approaching real-time performance.
SAM 2 isn't perfect. It can struggle with losing objects after scene cuts or long occlusions, segmenting very fine details, and tracking individual objects within groups of similar, moving objects. Explicit modeling of movement could help address this, the researchers note.
The researchers see SAM 2 as an important advance in machine vision that could drive further research and applications, such as robots that can move and interact with the real world more reliably, or enabling video effects in generative AI video models.
Meta releases the SAM 2 model, code, and weights as open source under the Apache 2.0 license. The SA-V dataset is released under the CC-by-4.0 license. SAM 2 can be tested in an interactive demo.