Combining YOLO Object Detection with Image Classification Models

→ Article about using YOLO object detection models with image classification models sequentially, using PyTorch and TensorFlow.

Object Detection and Image Classification are two distinct tasks, and each has a specific purpose. In short, object detection models localize and identify objects, and image classification models classify images. In this article, I will explain to you what Object Detection and Image Classification are, how to train models, and in the end, I will classify species and detect the position of dogs using Object Detection and Image Classification models sequentially.

YOLO Object Detection +Tensorflow Image Classification

What is Object Detection

Object Detection is a fundamental computer vision task used to identify and localize objects. Basically, an object detection model takes images as input, and it gives coordinates (bounding box coordinates) and labels (person, chair, bottle, etc.) as output.

Object Detection with Pretrained YOLO models (image source)

If you need to know objects coordinates, then you need to use object detection models, because Image Classification models doesnt give coordinates as output, they only return labels.

There are different Object Detection models like YOLO, SSD, and Faster R-CNN, and training an object detection model become easier and easier every year. With the help of user-friendly libraries you can train your own custom models.

What is Image Classification ?

Image Classification only gives labels as output. It is better for classifying the same kind of objects. For example, if you want to classify species of sea animals, you need to train an image classification model. Later in this article, I will share with you how I trained an image classification model to classify dog species.

Classification of Sea Animals with Tensorflow Image Classification Models

Why not only use an Object Detection Model ?

You might have noticed that object detection models provide both coordinates and labels as output, so why not just use an object detection model for everything? After all, they theoretically give coordinate and labels at the same time, so there is no need for classification models. You might think like this at first, but there are different factors that you might not be aware of:

  • Object detection models are great for identifying and locating various objects in a scene. But when it comes to distinguishing between objects that look almost identical, image classification models usually perform better (in general, not always).
  • You cannot always find an appropriate dataset, and creation of a dataset might be time-consuming and boring. If you decide to create your own dataset, creating an Image Classification dataset is way easier than Object Detection dataset.
YOLO Object Detection + Tensorflow Image Classification

Object Detection + Image Classification

Look at the below image, it exactly explains what are we going to do now. 

  1. Detect objects with pretrained YOLOv8 object detection model
  2. Classify detected objects using image classification model

Be aware that image classification model will perform only on detected objects, not in full image.

YOLO Object Detection + Tensorflow Image Classification

YOLO Object Detection Model for Detecting Dogs

I have already written an articles about how to train object detection models, specifically YOLO and Faster R-CNN models, you can read it.

Now, I will use a pre-trained YOLOv8 model, because it includes a dog class, and I will use the pre-trained model directly. I will perform detection with the YOLO model, and if it detects a dog, I will continue with image classification model.

Keep in mind, in general, it is better to train a model with a specific dataset for specific tasks, because sometimes pretrained models don’t have class names that fit your purpose.

Image Classification Model for Dog Species

I will use TensorFlow for training an image classification model. Training a model might take time depending on the dataset and parameters. You can use a pretrained image classification models, or you can follow the article below to train custom image classification models.

Training an Image Classification Model with Tensorflow Keras

Combine Object Detection and Image Classification Models

As I explained to you before, the process is very simple. First, the object detection model performs on the full image, then the image classification model performs only on the detected parts. I tried to explain every line of code with comments; I hope everything is clear.

# libraires
import cv2
import numpy as np
from ultralytics import YOLO
from tensorflow.keras.models import load_model
import matplotlib.pyplot as plt

# Load YOLO detection model
yolo_model = YOLO("yolov8s.pt")  # Replace with your YOLO model path

# Load classification model, you can run notebook and save model and use it (check step 2)
classification_model = load_model('dog_classification_model.h5')

# Classification labels
species_list = ['afghan_hound', 'african_hunting_dog', 'airedale', 'basenji', 'basset', 'beagle', 
                'bedlington_terrier', 'bernese_mountain_dog', 'black-and-tan_coonhound', 
                'blenheim_spaniel', 'bloodhound', 'bluetick', 'border_collie', 'border_terrier', 
                'borzoi', 'boston_bull', 'bouvier_des_flandres', 'brabancon_griffon', 'bull_mastiff', 
                'cairn', 'cardigan', 'chesapeake_bay_retriever', 'chow', 'clumber', 'cocker_spaniel', 
                'collie', 'curly-coated_retriever', 'dhole', 'dingo', 'doberman', 'english_foxhound', 
                'english_setter', 'entlebucher', 'flat-coated_retriever', 'german_shepherd', 
                'german_short-haired_pointer', 'golden_retriever', 'gordon_setter', 'great_dane', 
                'great_pyrenees', 'groenendael', 'ibizan_hound', 'irish_setter', 'irish_terrier', 
                'irish_water_spaniel', 'irish_wolfhound', 'japanese_spaniel', 'keeshond', 
                'kerry_blue_terrier', 'komondor', 'kuvasz', 'labrador_retriever', 'leonberg', 
                'lhasa', 'malamute', 'malinois', 'maltese_dog', 'mexican_hairless', 'miniature_pinscher', 
                'miniature_schnauzer', 'newfoundland', 'norfolk_terrier', 'norwegian_elkhound', 
                'norwich_terrier', 'old_english_sheepdog', 'otterhound', 'papillon', 'pekinese', 
                'pembroke', 'pomeranian', 'pug', 'redbone', 'rhodesian_ridgeback', 'rottweiler', 
                'saint_bernard', 'saluki', 'samoyed', 'schipperke', 'scotch_terrier', 
                'scottish_deerhound', 'sealyham_terrier', 'shetland_sheepdog', 'standard_poodle', 
                'standard_schnauzer', 'sussex_spaniel', 'tibetan_mastiff', 'tibetan_terrier', 
                'toy_terrier', 'vizsla', 'weimaraner', 'whippet', 'wire-haired_fox_terrier', 
                'yorkshire_terrier']



"""
Function to preprocess classification input:
  Before using the classification model, the image needs to be processed. 
  Resizing, normalizing, and adding dimensions are general steps.
  Each model expects a fixed image size, and it is decided before training.
  Here, I trained my model with 180x180 images,
  which is why in the preprocess_image function, I am resizing it to 180x180. 
"""
def preprocess_image(image, target_size):
    img = cv2.resize(image, target_size)  # Resize to target size
    img = img.astype('float32') / 255.0  # Normalize pixel values
    img = np.expand_dims(img, axis=0)  # Add batch dimension
    return img

# Perform inference
def classify_region(image, model, target_size=(180, 180)):  # Size must match the classification model's input
    input_image = preprocess_image(image, target_size)
    predictions = model.predict(input_image)
    predicted_index = np.argmax(predictions[0])
    predicted_label = species_list[predicted_index]
    return predicted_label

# Load the image
image_path = r"test-images/dog12.jpg"  # Path to your image
image = cv2.imread(image_path)

# YOLO inference --> Object Detection Model
results = yolo_model(image)
detections = results[0].boxes  # Get detections

# Check YOLO label for "dog" and process the bounding boxes
for detection in detections:
    x1, y1, x2, y2 = map(int, detection.xyxy[0].tolist())  # Get bbox coordinates
    conf = float(detection.conf[0])  # Get confidence
    cls_label = yolo_model.names[int(detection.cls[0])]  # Get the label name directly from YOLO

    # Check if the label is "dog"
    if cls_label == "dog":

        """
        Extract the region of interest for classification.
        Remember, the image classification model will perform 
        only on the detected objects, not on the entire image.
        """
        roi = image[y1:y2, x1:x2]

        # Classify the ROI if it's large enough
        if roi.shape[0] > 0 and roi.shape[1] > 0:
            # Image Classification Model
            label = classify_region(roi, classification_model)

            bbox_height = y2 - y1
            font_scale = bbox_height / 200.0  # Scale factor, adjust as needed
            font_thickness = max(1, int(bbox_height / 100))  # Ensure thickness is at least 1

            # Draw the bounding box and label
            cv2.rectangle(image, (x1, y1), (x2, y2), (255, 0, 0), 4)
            cv2.putText(image, label, (x1+100, y1-20), cv2.FONT_HERSHEY_SIMPLEX, font_scale, (0, 0, 255), font_thickness)
            print(f"Detected dog breed: {label}")

cv2.imwrite("dog2-result.jpg",image)
# Display the resulting image
plt.imshow(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))
plt.axis("off")
plt.show()
YOLO Object Detection + Tensorflow Image Classification