Detection and Segmentation with YOLO and SAM

 Detect objects with YOLO models, and by using detection information, segment these objects with SAM.

There are a bunch of segmentation models out there, and in general, they work similarly to each other. A typical pipeline for a segmentation model consists of collecting data, then configuring the model, and finally training a segmentation model. After that, you can run your model on videos, streams, or images. I have already written articles about training custom instance segmentation Mask R-CNN models, but to be honest, this pipeline might take time. Of course, there are different options, and one of them is SAM (Segment Anything Model).

With SAM, you only need to give some kind of coordinate to the model as input, and it will do all the segmentation for you. But SAM doesn’t provide labels as output, and it needs coordinate data as input. We will solve this by using pretrained YOLO object detection models.

Detection with YOLO object detection models + Segmentation with SAM (image source)

Why use Object Detection Model ?

Before proceeding, you might wonder: why are we using a YOLO model? The answer is simple: SAM doesn’t give label output. It only does segmentation from position input. Another important point is that SAM expects coordinate data as input; based on that input, it will do segmentation. YOLO models don’t give only label output; they also provide bounding box coordinates.

By using an object detection model, the object position and label will be found. Then, from the position data, segmentation will be done, and in the end, you will have a nicely segmented and labeled object.

Segmentation with SAM (sam)

How Actually SAM Works ?

Unlike other segmentation models you need to give position data to the SAM, there are 3 different information type that you can give as input:

  1. Single Point: one point pair(x,y)
  2. Bounding Box: bounding box coordinates (x1, y1, x2, y2)
  3. Multiple Points: positive and negative points

Because we are using the YOLO model, the bounding box approach is the best fit for our purpose. The YOLO model gives bounding box coordinates as output, and we can directly use this data for segmentation. Just check the diagram below; it clearly explains what we are trying to build.

Object Detection with Pretrained YOLO model

I already have an article about how to train a custom YOLO model, you can read it (link), but now I will use the pretrained yolov8n.pt model. If you decide to use a pretrained model, you can directly use the same code, and it will download the yolov8n model directly.

%pip install ultralytics
from ultralytics import YOLO

# Load a model
model = YOLO("yolov8n.pt")  # pretrained YOLO11n model

# Run batched inference on a list of images
results = model([r"ball.jpg"])  # return a list of Results objects

# Process results list
for result in results:
    boxes = result.boxes  # Boxes object for bounding box outputs
    masks = result.masks  # Masks object for segmentation masks outputs
    keypoints = result.keypoints  # Keypoints object for pose outputs
    probs = result.probs  # Probs object for classification outputs
    obb = result.obb  # Oriented boxes object for OBB outputs
    result.show()  # display to screen
    result.save(filename="result.jpg")  # save to disk
Object Detection with pretrained YOLO model

Detection with YOLO and Segmentation with SAM

First, you need to download the SAM model. You can download it from Hugging Face (link).

1. Install Libraries

%pip install git+https://github.com/facebookresearch/segment-anything.git
%pip install opencv-python mediapipe ultralytics numpy torch matplotlib

2. Import Libraries

import torch
import cv2
import numpy as np
import matplotlib.pyplot as plt
from ultralytics import YOLO
from segment_anything import sam_model_registry, SamPredictor

3. Load Models

# Load YOLO model
model = YOLO("yolov8n.pt") 

# Load SAM model
sam_checkpoint = "sam_vit_b_01ec64.pth"  # Replace with your model path
model_type = "vit_b"
device = "cuda" if torch.cuda.is_available() else "cpu"
print(device)

sam = sam_model_registry[model_type](checkpoint=sam_checkpoint)
sam.to(device=device)
predictor = SamPredictor(sam) 

4. Detection + Segmentation

# Load image
image_path = r"ball.jpg"
image = cv2.imread(image_path)
image_rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)  # Convert to RGB for SAM

# Run YOLO inference
results = model(image, conf=0.3)

# Loop through detections
for result in results:
    # Get bounding boxes
    for box, cls in zip(result.boxes.xyxy, result.boxes.cls):   
        x1, y1, x2, y2 = map(int, box)  # Convert to integers
        
        # get ID
        class_id = int(cls)  # Class ID
        # Get class label
        class_label = model.names[class_id]   
        # Prepare SAM
        predictor.set_image(image_rgb)

        # Define a box prompt for SAM
        input_box = np.array([[x1, y1, x2, y2]])

        # Get SAM mask
        masks, _, _ = predictor.predict(box=input_box, multimask_output=False)

        # Create a copy of the original image for overlaying the mask
        highlighted_image = image_rgb.copy()

        # Apply the mask with semi-transparent blue color to the image
        mask = masks[0]
        # Create a blank image
        blue_overlay = np.zeros_like(image_rgb, dtype=np.uint8)  
        # Blue color for the segmented area (RGB)
        blue_overlay[mask == 1] = [0, 0, 255]   

        # Blend the blue overlay with the original image using transparency
        alpha = 0.7  # Transparency level for the overlay
        highlighted_image = cv2.addWeighted(highlighted_image, 1 - alpha, blue_overlay, alpha, 0)

         # Add label (class name) on top of the bounding box
        font = cv2.FONT_HERSHEY_SIMPLEX
        label = f"{class_label}"  # Label is the class name
        cv2.putText(highlighted_image, label, (x1, y1 - 10), font, 2, (255, 255, 0), 2, cv2.LINE_AA)   

        # Optional: Save the image with the bounding box and highlighted segment
        output_filename = f"highlighted_output.png"
        cv2.imwrite(output_filename, cv2.cvtColor(highlighted_image, cv2.COLOR_RGB2BGR))
Result: Detection with YOLO model and Segmentation with SAM model
Result 2: Detection with YOLO model and Segmentation with SAM model