Introduction to SAM2: Installation Guide and Performing Segmentation on Video

→ Article about how to segment objects using the pretrained promptable SAM2 segmentation models on videos.

Segment Anything (SAM) is a promptable segmentation model that works with images; you basically give position information as input, and the SAM model does all the segmentation. You don’t have to train any model, you only have to provide positional input. Positional input can be positive points (points within the object), rectangle coordinates, or positive and negative points combined.

You can think of SAM2 as an advanced version of SAM; it works on videos as well. This article will be a quick introduction to SAM2; I will show you how to create an environment for using SAM2 models, and how to perform segmentation on videos using the pretrained SAM2 promptable segmentation models.

Segmentation of an object in a video using pretrained promptable SAM2 segmentation models (video)

Also, I have a YouTube video about this article, you can watch it.

This will be a quick introduction to SAM2, but I am planning to write more articles about SAM and SAM2. I already have an article about SAM, you can check that as well.

Installation Guide

You can follow the GitHub repository (link) of SAM2, but here I will only show you the necessary steps.

It is highly recommended to create a GPU-supported PyTorch environment for experimenting with SAM2 models; but it is not mandatory. However, I must warn you, if you don’t create a GPU-supported PyTorch environment, it might take more than 10 hours to even process a 10-second video. So, it is up to you 🙂

  • python>=3.10
  • torch>=2.5.1
  • torchvision>=0.20.1

I have an step-by-step installation guide for GPU supported PyTorch environment(both miniconda and python virtual environment), you can read it. Also, you can use Kaggle or Google Colab servers for GPU support.

Anyway, I created a miniconda environment that named sam2, and you can see GPU is available.

GPU supported(CUDA) PyTorch Environment

Okay, the next step is to clone the SAM2 repository, and install the other necessary libraries.

git clone https://github.com/facebookresearch/sam2.git 
cd sam2
pip install -e ".[notebooks]"

There might be some warnings like  Failed to build the SAM 2 CUDA extension during installation, but it is not a problem. You can still continue.

Installation is finished, now it is time for downloading pretrained SAM2 models.

Segmentation of a cat using SAM2 segmentation model

Download Pretrained SAM2 Models

As you can see from the below image, there are multiple pretrained models. I have an old GPU (maybe older than some folks here 🙂 ), so I will stick with the tiny model. Even for a 10-second video, processing took about 20 minutes. You can download models from the GitHub repository(link) of SAM2, or you can directly run the code in the below image, and it will download all the pretrained SAM2 segmentation models.

Pretrained SAM2 segmentation models

Perform Segmentation on Video

Inside the SAM2 repository, there are demo applications. Now I will use the video_predictor_example.ipynb notebook to perform segmentation on videos, and you can find this notebook here:

  • sam2/notebooks/video_predictor_example.ipynb

I am not going to copy and paste all the code here, you can directly follow the notebook(GitHub link). Now, I will only show you the key parts, and I will add one additional part for saving the video.

First, we need to load the pretrained SAM2 model that we already downloaded. If you have a different model, you need to change the sam2_checkpoint and model_cfg variables.

from sam2.build_sam import build_sam2_video_predictor

sam2_checkpoint = "../checkpoints/sam2.1_hiera_tiny.pt"
model_cfg = "configs/sam2.1/sam2.1_hiera_t.yaml"

predictor = build_sam2_video_predictor(model_cfg, sam2_checkpoint, device=device)

This notebook expects each frame of the video to be saved in folder. You can directly use videos and frames from the sam2/notebooks/videos folder, or if you have a custom video, you need to save frames to a folder one by one. You can use OpenCV or ffmpeg, it’s up to you. I did it with ffmpeg, and if you decide to use OpenCV , it will take around 10 lines of code :). I want to use a custom video (video of a super cute cat), so I saved frames one by one to a folder.

!ffmpeg -i sam2/videos/cat.mp4 -q:v 2 -start_number 0 sam2/videos/frames/%05d.jpg
Segmentation on images with SAM2

Now we will give position input to the SAM2 model using the first frame, it is basically a point pair within the target object’s boundaries. It is better if you choose around the center of your target object (cat in my case). Change the points variable based on the position of the target object.

ann_frame_idx = 0  # first frame
ann_obj_id = 1  #  unique id to each object 

# point pair coordinates --> (210, 350) to get started
points = np.array([[210, 350]], dtype=np.float32)
# for labels, `1` means positive click(within object boundaries) and `0` means negative click(outside of the object boundaries)
labels = np.array([1], np.int32)
_, out_obj_ids, out_mask_logits = predictor.add_new_points_or_box(
    inference_state=inference_state,
    frame_idx=ann_frame_idx,
    obj_id=ann_obj_id,
    points=points,
    labels=labels,
)

# show the results
plt.figure(figsize=(9, 6))
plt.title(f"frame {ann_frame_idx}")
plt.imshow(Image.open(os.path.join(video_dir, frame_names[ann_frame_idx])))
show_points(points, labels, plt.gca())
show_mask((out_mask_logits[0] > 0.0).cpu().numpy(), plt.gca(), obj_id=out_obj_ids[0])
Segmentation of a cat using SAM2 segmentation model

You can add more points, you will see in video_predictor_example.ipynb notebook.

Okay, now it is time for processing all the frames. This step might take a really long time depending on your GPU, video resolution, and length.

# run propagation throughout the video and collect the results in a dict
video_segments = {}  # video_segments contains the per-frame segmentation results
for out_frame_idx, out_obj_ids, out_mask_logits in predictor.propagate_in_video(inference_state):
    video_segments[out_frame_idx] = {
        out_obj_id: (out_mask_logits[i] > 0.0).cpu().numpy()
        for i, out_obj_id in enumerate(out_obj_ids)
    }

# render the segmentation results every 30 frames, you can change this
vis_frame_stride = 30
plt.close("all")
for out_frame_idx in range(0, len(frame_names), vis_frame_stride):
    plt.figure(figsize=(6, 4))
    plt.title(f"frame {out_frame_idx}")
    plt.imshow(Image.open(os.path.join(video_dir, frame_names[out_frame_idx])))
    for out_obj_id, out_mask in video_segments[out_frame_idx].items():
        show_mask(out_mask, plt.gca(), obj_id=out_obj_id)
Segmentation of a cat using SAM2 segmentation model

If you want to save all segmented frames and create a video, you can use this code block. You can’t find this in the notebook, it is an optional step.

import cv2
import numpy as np
from PIL import Image

# Define video writer
h, w = Image.open(os.path.join(video_dir, frame_names[0])).size[::-1]  # (height, width)
out = cv2.VideoWriter("segmented_output.mp4", cv2.VideoWriter_fourcc(*"mp4v"), 25, (w, h))

# Loop through all frames in order
for out_frame_idx in range(len(frame_names)):
    # Load original frame
    frame_path = os.path.join(video_dir, frame_names[out_frame_idx])
    frame = np.array(Image.open(frame_path).convert("RGB"))

    # Convert to BGR for OpenCV
    frame = cv2.cvtColor(frame, cv2.COLOR_RGB2BGR)

    # If we have segmentation results for this frame
    if out_frame_idx in video_segments:
        for out_obj_id, out_mask in video_segments[out_frame_idx].items():
            # Convert mask to uint8
            mask = (out_mask.astype(np.uint8) * 255)

            # Make a color overlay for the mask
            color = np.random.randint(0, 255, (3,), dtype=np.uint8)
            colored_mask = np.zeros_like(frame, dtype=np.uint8)
            for c in range(3):
                colored_mask[:, :, c] = mask * color[c]

            # Blend with original frame (alpha blending)
            frame = cv2.addWeighted(frame, 1.0, colored_mask, 0.5, 0)

    # Write frame to video
    out.write(frame)

out.release()

Segmentation on video using SAM2 models

Okay, that was a quick introduction to SAM2, you can check other notebooks to learn more. That’s it from me, see you soon.