Meta's SAM 3 segmentation model blurs the boundary between language and vision

Meta releases the third generation of its "Segment Anything Model." Unlike standard models limited to fixed categories, SAM 3 uses an open vocabulary to understand both images and videos. The system relies on a new training method combining human and AI annotators.

Users can use text prompts, example images ("exemplar prompts"), or visual prompts to isolate specific concepts in both images and videos. Alongside the model weights and code, Meta has launched the Segment Anything Playground, a web interface where users can test the model.

Bridging the gap between language and vision

According to Meta, connecting language with visual elements remains a major hurdle in computer vision. Standard models recognize basic objects like "person" but struggle with nuanced descriptions like "the striped red umbrella" because they rely on predefined categories.

SAM 3 addresses this with "Promptable Concept Segmentation." The model takes short noun phrases or reference images to find every instance of a concept within a piece of media. It still supports the visual prompts from SAM 1 and SAM 2, such as masks, bounding boxes, or dots.

A living room in which almost all objects are outlined in color and labeled, including the sofa, table, lamps, and cabinets. — SAM 3 identifies and segments specific concepts within complex images, such as this living room scene. | Image: Meta

In internal benchmarks, specifically the new "Segment Anything with Concepts" (SA-Co) test, SAM 3 reportedly doubles the performance of existing systems. Meta claims the model outperforms specialized tools like GLEE and OWLv2, as well as large multimodal models like Gemini 2.5 Pro.

Four bar charts show the performance of SAM 3 compared to other models such as OWLv2, Gemini 2.5 Pro, and human performance in categories such as concept segmentation and counting. — Meta's internal benchmarks show SAM 3 (blue) outperforming existing models across several categories, though it still trails human performance. | Image: Meta

A hybrid approach speeds up training

Meta built a hybrid "data engine" for training—a pipeline where AI models, including SAM 3 and Llama-based captioning, generate initial segmentation masks. Human and AI annotators then verify and correct these suggestions.

According to the research paper, this process significantly speeds up annotation. AI assistance is roughly five times faster than manual effort for negative prompts (object not present) and 36 percent more efficient for positive ones. This resulted in a training dataset containing over four million unique concepts.

A flowchart shows the structure of the data engine. — Meta's "Data Engine" scales training data efficiently by combining AI suggestions with human and machine verification. | Image: Meta

Meta is already integrating the technology into its products. On Facebook Marketplace, SAM 3 powers the "View in Room" feature, letting users virtually place furniture in their homes. In Instagram’s ‘Edits’ app, SAM 3 will soon enable effects that creators can apply to specific people or objects.

Running on an Nvidia H200 GPU, SAM 3 processes an image with over 100 recognized objects in 30 milliseconds. For video, latency scales with object count but allows near real-time processing for about five simultaneous objects.

Recommendation

AI research

Apple's claims about large reasoning models face fresh scrutiny from a new study

Meta notes several limitations. SAM 3 struggles with highly specific technical terms outside its training data ("zero-shot"), such as those in medical imaging. The model also fails with complex logical descriptions like "the second to last book from the right on the top shelf." To address this, Meta suggests pairing SAM 3 with multimodal language models such as Llama or Gemini, a combination it calls the "SAM 3 Agent."

Reconstructing 3D worlds from 2D images

Alongside SAM 3, Meta released SAM 3D, a suite of two models designed to generate 3D reconstructions from single 2D images.

SAM 3D Objects focuses on reconstructing objects and scenes. Since 3D training data is scarce compared to 2D images, Meta applied its "data engine" principle here as well. Annotators rate multiple AI-generated mesh options, while the hardest examples are routed to expert 3D artists. This method allowed Meta to annotate nearly one million images with 3D information, creating a system that turns photos into manipulable 3D objects.

The second model, SAM 3D Body, specializes in capturing human poses and shapes. It uses the new "Meta Momentum Human Rig" (MHR) format, which separates skeletal structure from soft tissue shape. Trained on approximately eight million images, the model works robustly even with occlusions or unusual postures.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Meta says the technology is still in its early stages. The resolution of generated 3D objects remains limited, leading to detail loss in complex structures. Additionally, SAM 3D Objects treats items in isolation and cannot yet correctly simulate physical interactions between multiple objects. Meta also notes that SAM 3D Body has not yet reached the precision of specialized hand-tracking tools.

Meta's SAM 3 segmentation model blurs the boundary between language and vision

Bridging the gap between language and vision

A hybrid approach speeds up training

Apple's claims about large reasoning models face fresh scrutiny from a new study

Reconstructing 3D worlds from 2D images

Yann LeCun leaves Meta to launch new AI startup

Yann LeCun unveils LeJEPA, likely his final Meta project before launching a startup

Yann LeCun reportedly leaving Meta to launch new AI startup

Gemini 3 Pro tops new AI reliability benchmark, but hallucination rates remain high

Researchers push "Context Engineering 2.0" as the road to lifelong AI memory

German court deepens the split on AI and copyright with its latest ruling

Meta's SAM 3 segmentation model blurs the boundary between language and vision

Bridging the gap between language and vision

A hybrid approach speeds up training

Reconstructing 3D worlds from 2D images

Share

Bank details