Run Any Deep Learning Model with ONNX Runtime in Python

→ Article about running any deep learning model with ONNX Runtime such as YOLO, Faster R-CNN, ResNet, and more.

There are different libraries and frameworks for training and running different deep learning models. You can run YOLO object detection models with Ultralytics, or for image classification models, you can use Keras or PyTorch.

Using ONNX Runtime, you can run any deep learning model. Also, ONNX Runtime supports running models on both GPU and CPU.

In this article, I will show you how to run YOLO object detection, Faster R-CNN object detection, and ResNet image classification models using only ONNX Runtime.

Running YOLO Object Detection models on ONNX Runtime with GPU Support

Setting Up Environment

If you want to run models on GPU, you need to have Nvidia GPU on your computer. You need to install CUDA, and after that you can install onnx runtime with GPU support.

My environment:

  • Operating system: Ubuntu 22.04
  • GPU: GeForce GTX 1660 Ti
  • CUDA Version: 12.8
  • Driver Version: 570.169
CUDA
CUDA installed on my system

You need to find the appropriate CUDA version for your GPU. You can follow this Nvidia documentation. Trust me, there is nothing complex, just follow the documentation and pay attention to the versions.

If you can’t figure out which versions you need, you can even ask GPT. If you provide your GPU name, it can help you with the versions.

After you have installed CUDA on your system, it is time to install ONNX Runtime. You can’t directly install ONNX Runtime with pip, because specific versions must be installed for different CUDA versions. Check the image below.

onnxruntime documentation

I have CUDA 12.8, and I installed ONNX Runtime as shown below:

pip install onnxruntime-gpu>=1.19

If you don’t have a GPU, or if you want to use your CPU, you can install it directly as shown below:

pip install onnxruntime
GPU-supported ONNX Runtime env for faster inference with any deep learning model
GPU-supported ONNX Runtime env for faster inference with any deep learning model

Now, it is time for coding. There will be 3 different sections:

  1. YOLO Object Detection
  2. Faster R-CNN object detection
  3. ResNet image classification

Before starting, as the name suggests, the first step will be to export the model to the ONNX format.

1. Run YOLO Object Detection Models using ONNX Runtime

You can directly find the YOLO model in ONNX format on the internet, but I will use Ultralytics for model conversion. You can download class list(coco-labels-paper.txt) from this link.

from ultralytics import YOLO

# pretrained YOLOv8 model
model = YOLO('yolov8n.pt')
# Export the model to ONNX format
model.export(format='onnx')
import onnxruntime as ort
import cv2
import numpy as np
import time

# Load ONNX model
session = ort.InferenceSession("models/yolov8n.onnx", providers=['CUDAExecutionProvider'])
input_name = session.get_inputs()[0].name

# Read video
video_path = "videos/street1.mp4"
cap = cv2.VideoCapture(video_path)

# get width and height
width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))

# FPS calculation variables
fps_counter = 0
fps_start_time = time.time()
fps_display = 0.0

# Load COCO class names from file
with open('videos/coco-labels-paper.txt', 'r') as f:
    class_names = [line.strip() for line in f.readlines()]

def postprocess_yolo_output(outputs, original_shape, conf_threshold=0.3, iou_threshold=0.45):
    """Post-process YOLOv8 ONNX output"""
    predictions = outputs[0]  # Shape: [1, 84, 8400]
    predictions = predictions[0]  # Remove batch dimension: [84, 8400]
    predictions = predictions.T  # Transpose to [8400, 84]
    
    # Extract boxes and scores
    boxes = predictions[:, :4]  # First 4 columns are bbox coordinates
    scores = predictions[:, 4:]  # Remaining columns are class scores
    
    # Get the class with highest score for each detection
    class_ids = np.argmax(scores, axis=1)
    confidences = np.max(scores, axis=1)
    
    # Filter by confidence threshold
    valid_detections = confidences > conf_threshold
    boxes = boxes[valid_detections]
    confidences = confidences[valid_detections]
    class_ids = class_ids[valid_detections]
    
    # Convert from center format to corner format
    x_center, y_center, width, height = boxes[:, 0], boxes[:, 1], boxes[:, 2], boxes[:, 3]
    x1 = x_center - width / 2
    y1 = y_center - height / 2
    x2 = x_center + width / 2
    y2 = y_center + height / 2
    
    boxes = np.column_stack((x1, y1, x2, y2))
    
    # Scale boxes to original image size
    orig_h, orig_w = original_shape[:2]
    boxes[:, [0, 2]] *= orig_w / 640  # Scale x coordinates
    boxes[:, [1, 3]] *= orig_h / 640  # Scale y coordinates

    # Apply Non-Maximum Suppression to eliminate duplicate detections
    indices = cv2.dnn.NMSBoxes(boxes.tolist(), confidences.tolist(), conf_threshold, iou_threshold)

    # Check if any boxes remain after NMS
    if len(indices) > 0:
        indices = indices.flatten()
        return boxes[indices], confidences[indices], class_ids[indices]
    else:
        return [], [], []

# Process video frames
while cap.isOpened():
    frame_start_time = time.time()
    
    ret, frame = cap.read()
    if not ret:
        break

    # YOLOv8 preprocessing
    img = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
    img_resized = cv2.resize(img, (640, 640))
    img_resized = img_resized.astype(np.float32) / 255.0
    img_resized = np.transpose(img_resized, (2, 0, 1))  # HWC -> CHW
    img_resized = np.expand_dims(img_resized, axis=0)

    # Run inference
    outputs = session.run(None, {input_name: img_resized})

    # Post-process outputs
    boxes, confidences, class_ids = postprocess_yolo_output(outputs, frame.shape)

    # Draw detections
    for box, conf, cls_id in zip(boxes, confidences, class_ids):
        x1, y1, x2, y2 = map(int, box)
        
        # Draw bounding box
        cv2.rectangle(frame, (x1, y1), (x2, y2), (0, 255, 0), 2)
        
        # Draw label
        label = f"{class_names[cls_id]}: {conf:.2f}"
        label_size = cv2.getTextSize(label, cv2.FONT_HERSHEY_SIMPLEX, 0.5, 1)[0]
        cv2.rectangle(frame, (x1, y1 - label_size[1] - 10), 
                     (x1 + label_size[0], y1), (0, 255, 0), -1)
        cv2.putText(frame, label, (x1, y1 - 5),
                   cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 0, 0), 1)

    # Add title (optional)
    title_text = "YOLOv8 Object Detection"
    title_font = cv2.FONT_HERSHEY_SIMPLEX
    title_scale = 1.0
    title_thickness = 2
    title_size = cv2.getTextSize(title_text, title_font, title_scale, title_thickness)[0]
    title_x = (width - title_size[0]) // 2
    title_y = 40
    
    # Draw background rectangle for title
    cv2.rectangle(frame, (title_x - 10, title_y - title_size[1] - 10), 
                 (title_x + title_size[0] + 10, title_y + 10), (0, 0, 0), -1)
    # Draw title text
    cv2.putText(frame, title_text, (title_x, title_y), title_font, title_scale, (255, 255, 255), title_thickness)

    # Calculate and display FPS
    fps_counter += 1
    if fps_counter % 10 == 0:  # Update FPS display every 10 frames
        fps_end_time = time.time()
        fps_display = 10 / (fps_end_time - fps_start_time)
        fps_start_time = fps_end_time
    
    # Draw FPS
    fps_text = f"FPS: {fps_display:.1f}"
    cv2.rectangle(frame, (5, height - 40), (120, height - 10), (0, 0, 0), -1)
    cv2.putText(frame, fps_text, (10, height - 20), cv2.FONT_HERSHEY_SIMPLEX, 0.7, (0, 255, 0), 2)

    # --- 10. Display ---
    cv2.imshow("YOLOv8 ONNX Runtime GPU", frame)

    if cv2.waitKey(1) & 0xFF == ord('q'):
        break

# Release everything
cap.release()
cv2.destroyAllWindows()
Running YOLO Object Detection models on ONNX Runtime with GPU Support

I had an article about training custom YOLO models, you can read it.

2. Run Faster R-CNN Object Detection Models using ONNX Runtime

Export the ONNX model for the pretrained Faster R-CNN model. I will use PyTorch for exporting, but of course, you can directly find the ONNX model on the internet.

import torch
import torchvision

model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
model.eval()

dummy = torch.randn(1, 3, 800, 800)  # typical input size

# Export the model
torch.onnx.export(model, dummy, "fasterrcnn.onnx",
                  opset_version=11,
                  input_names=["images"],
                  output_names=["boxes", "labels", "scores"])

import onnxruntime as ort
import cv2
import numpy as np
import time

# Check available providers first
print("Available ONNX Runtime providers:", ort.get_available_providers())
print("CUDA provider available:", 'CUDAExecutionProvider' in ort.get_available_providers())

# Load pretrained Faster R-CNN model
session = ort.InferenceSession("models/fasterrcnn.onnx", providers=['CUDAExecutionProvider'])

# Verify which providers are actually being used
print("Session providers:", session.get_providers())
print("Using GPU:", 'CUDAExecutionProvider' in session.get_providers())

# Check input/output details
input_details = session.get_inputs()[0]
print(f"\nInput name: {input_details.name}")
print(f"Input shape: {input_details.shape}")
print(f"Input type: {input_details.type}")

for i, output in enumerate(session.get_outputs()):
    print(f"Output {i}: {output.name}, shape: {output.shape}")

# Load COCO class names from file (includes __background__ class for Faster R-CNN)
with open('videos/coco-labels-fasterrcnn.txt', 'r') as f:
    coco_classes = [line.strip() for line in f.readlines()]

def preprocess_frame(frame, target_size=(800, 800)):
    """Preprocess frame for Faster R-CNN"""
    # Convert BGR to RGB
    rgb_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
    
    # Resize image
    resized = cv2.resize(rgb_frame, target_size)
    
    # Normalize to [0, 1]
    normalized = resized.astype(np.float32) / 255.0
    
    # Convert HWC to CHW
    chw = np.transpose(normalized, (2, 0, 1))
    
    # Add batch dimension
    batch = np.expand_dims(chw, axis=0)
    
    return batch

def postprocess_detections(outputs, original_shape, conf_threshold=0.5):
    """Post-process Faster R-CNN outputs"""
    # Faster R-CNN typically outputs: boxes, labels, scores
    # The exact order depends on how the model was exported
    if len(outputs) == 3:
        boxes, labels, scores = outputs
    else:
        # If single output, it might be a dictionary-like structure
        # We'll need to adapt based on actual output format
        print(f"Unexpected number of outputs: {len(outputs)}")
        return [], [], []
    
    # Remove batch dimension if present
    if len(boxes.shape) == 3:
        boxes = boxes[0]
    if len(labels.shape) == 2:
        labels = labels[0]
    if len(scores.shape) == 2:
        scores = scores[0]
    
    # Filter by confidence threshold
    valid_indices = scores > conf_threshold
    filtered_boxes = boxes[valid_indices]
    filtered_labels = labels[valid_indices]
    filtered_scores = scores[valid_indices]
    
    # Scale boxes to original image size
    orig_h, orig_w = original_shape[:2]
    if len(filtered_boxes) > 0:
        # Boxes are typically in format [x1, y1, x2, y2] and normalized to [0, 800]
        filtered_boxes[:, [0, 2]] *= orig_w / 800  # Scale x coordinates
        filtered_boxes[:, [1, 3]] *= orig_h / 800  # Scale y coordinates
    
    return filtered_boxes, filtered_labels, filtered_scores

# --- 2. Open video ---
video_path = "videos/street1.mp4"
cap = cv2.VideoCapture(video_path)

# FPS calculation variables
fps_counter = 0
fps_start_time = time.time()
fps_display = 0.0

# Determine device string for display
device_str = "GPU" if 'CUDAExecutionProvider' in session.get_providers() else "CPU"
print(f"\nRunning Faster R-CNN with ONNX Runtime on {device_str}...")
print("Press 'q' to quit the video display")

# Loop through video frames
while cap.isOpened():
    ret, frame = cap.read()
    if not ret:
        break

    # Preprocess frame 
    input_tensor = preprocess_frame(frame)

    # Run inference 
    try:
        outputs = session.run(None, {input_details.name: input_tensor})
        
        # Post-process 
        boxes, labels, scores = postprocess_detections(outputs, frame.shape)
        
        # Draw detections 
        for box, label, score in zip(boxes, labels, scores):
            if len(box) >= 4:
                x1, y1, x2, y2 = map(int, box[:4])
                
                # Draw bounding box
                cv2.rectangle(frame, (x1, y1), (x2, y2), (0, 255, 0), 2)
                
                # Draw label
                class_name = coco_classes[int(label)] if int(label) < len(coco_classes) else f"class_{int(label)}"
                label_text = f"{class_name}: {score:.2f}"
                label_size = cv2.getTextSize(label_text, cv2.FONT_HERSHEY_SIMPLEX, 0.5, 1)[0]
                cv2.rectangle(frame, (x1, y1 - label_size[1] - 10), 
                             (x1 + label_size[0], y1), (0, 255, 0), -1)
                cv2.putText(frame, label_text, (x1, y1 - 5),
                           cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 0, 0), 1)
    
    except Exception as e:
        print(f"Inference error: {e}")
        # Continue with next frame even if inference fails

    # Calculate and display FPS
    fps_counter += 1
    if fps_counter % 10 == 0:  # Update FPS display every 10 frames
        fps_end_time = time.time()
        fps_display = 10 / (fps_end_time - fps_start_time)
        fps_start_time = fps_end_time
    
    # Draw FPS with device info
    fps_text = f"FPS: {fps_display:.1f} (Faster R-CNN-{device_str})"
    cv2.rectangle(frame, (5, 5), (250, 35), (0, 0, 0), -1)
    cv2.putText(frame, fps_text, (10, 25), cv2.FONT_HERSHEY_SIMPLEX, 0.6, (0, 255, 0), 2)

    # Display 
    cv2.imshow("Faster R-CNN ONNX Runtime", frame)

    if cv2.waitKey(1) & 0xFF == ord('q'):
        break

cap.release()
cv2.destroyAllWindows()

print("Video processing completed!")
Running Faster R-CNN Object Detection models on ONNX Runtime with GPU Support

I had an article about training custom Faster R-CNN models, you can read it.

3. Run ResNet Image Classification Models using ONNX Runtime

There are different ResNet image classification models. You can download the models from this link. You can also download the label list as shown below:

!wget https://s3.amazonaws.com/deep-learning-models/image-models/imagenet_class_index.json

import onnxruntime as ort
import numpy as np
import cv2
import json

# Load ONNX model
session = ort.InferenceSession("models/resnet50_Opset17.onnx", providers=['CUDAExecutionProvider'])

# Load ImageNet class labels
with open("imagenet_class_index.json") as f:
    class_idx = json.load(f)

# Preprocess image
img = cv2.imread("videos/cat.jpg")  # Your image path
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
img = cv2.resize(img, (224, 224))
img = img.astype(np.float32) / 255.0

# Normalize using ImageNet mean/std
mean = np.array([0.485, 0.456, 0.406], dtype=np.float32)
std  = np.array([0.229, 0.224, 0.225], dtype=np.float32)
img = (img - mean) / std

# HWC -> CHW
img = np.transpose(img, (2, 0, 1))
img = np.expand_dims(img, 0)  # Add batch dimension

# Run inference 
input_name = session.get_inputs()[0].name
outputs = session.run(None, {input_name: img})

# Postprocess 
logits = outputs[0][0]  # [1000]
# Apply softmax to get probabilities
probs = np.exp(logits) / np.sum(np.exp(logits))  

# Get top-1 class
class_id = np.argmax(probs)
confidence = probs[class_id]
label = class_idx[str(class_id)][1]  # e.g., "golden_retriever"

print(f"Predicted class ID: {class_id}")
print(f"Label: {label}")
print(f"Probability: {confidence:.4f}")

 ResNet Image Classification Model on ONNX Runtime
ResNet Image Classification Model on ONNX Runtime
Output of ResNet Image Classification Model