Welcome readers, the CV class is back in session! We’ve previously studied 30+ different computer vision models so far in my previous blog, each bringing their own unique strengths to the table from the rapid detection skills of YOLO to the transformative power of Vision Transformers (ViTs). Today, we’re introducing a new student to our classroom: RF-DETR. Read on to know everything about Roboflow’s RF-DETR and how it is bridging the speed and accuracy in object detection.
RF-DETR is a real-time transformer-based object detection model that achieves over 60 mAP on the COCO dataset, showcasing an impressive accomplishment. Naturally, we’re curious: Will RF-DETR be able to match YOLO’s speed? Can it adapt to diverse tasks we encounter in the real world?
That’s what we’re here to explore. In this article, we’ll break down RF-DETR’s core features, its real-time capabilities, strong domain adaptability, and open-source availability and see how it performs alongside other models. Let’s dive in and see if this newcomer has what it takes to excel in real-world applications!
Object detection models are increasingly challenged to prove their worth beyond just COCO – a dataset that, while historically critical, hasn’t been updated since 2017. As a result, many models show only marginal improvements on COCO and turn to other datasets (e.g., LVIS, Objects365) to demonstrate generalizability.
RF100-VL: Roboflow’s new benchmark that collects around 100 diverse datasets (aerial imagery, industrial inspections, etc) out of 500,000+ on Roboflow Universe. This benchmark emphasizes domain adaptability, a critical factor for real-world use cases where data can look drastically different from COCO’s common objects.
In the above table, we can see how RF-DETR stacks up against other real-time object detection models:
Note: As of now code and checkpoint for RF-DETR-large and RF-DETR-base are available.
In this chart, RF-DETR demonstrates competitive accuracy with YOLO models while keeping latency in the same range. RF-DETR surpasses the 60 mAP threshold making it the first documented real-time model to achieve this performance level on COCO.
Here, RF-DETR stands out by achieving the highest mAP on RF100-VL indicating strong adaptability across varied domains. This suggests that RF-DETR is not only competitive on COCO but also excels at handling real-world datasets where domain-specific objects and conditions might differ significantly from common objects in COCO.
Based on the performance metrics from the Roboflow leaderboard, RF-DETR demonstrates competitive results in both accuracy and efficiency.
This ranking further highlights RF-DETR’s efficiency, delivering high performance with optimized latency while maintaining a smaller model size compared to some competitors.
Historically, CNN-based YOLO models have led the pack in real-time object detection. Yet, CNNs alone do not always benefit from large-scale pre-training, which is increasingly pivotal in machine learning.
Transformers excel with large-scale pre-training but have often been too bulky(heavy) or slow for real-time applications. Recent work, however, shows that DETR-based models can match YOLO’s speed when we consider the post-processing overhead YOLO requires.
Read this for more information, read this research paper.
Install RF-DETR via:
!pip install rfdetr
You can then load a pre-trained checkpoint (trained on COCO) for immediate use in your application:
import io
import requests
import supervision as sv
from PIL import Image
from rfdetr import RFDETRBase
model = RFDETRBase()
url = "https://media.roboflow.com/notebooks/examples/dog-2.jpeg"
image = Image.open(io.BytesIO(requests.get(url).content))
detections = model.predict(image, threshold=0.5)
annotated_image = image.copy()
annotated_image = sv.BoxAnnotator().annotate(annotated_image, detections)
annotated_image = sv.LabelAnnotator().annotate(annotated_image, detections)
sv.plot_image(annotated_image)
I will be providing you my Github Repository Link for you to freely implement the model yourselves 🙂. Just follow the README.md instructions to run the code.
Code:
import cv2
import numpy as np
import json
from rfdetr import RFDETRBase
# Load the model
model = RFDETRBase()
# Read the classes.json file and store class names in a dictionary
with open('classes.json', 'r', encoding='utf-8') as file:
class_names = json.load(file)
# Open the video file
cap = cv2.VideoCapture('walking.mp4') # https://www.pexels.com/video/video-of-people-walking-855564/
# Create the output video
fourcc = cv2.VideoWriter_fourcc(*'XVID')
out = cv2.VideoWriter('output.mp4', fourcc, 20.0, (960, 540))
# For live video streaming:
# cap = cv2.VideoCapture(0) # 0 refers to the default camera
while True:
# Read a frame
ret, frame = cap.read()
if not ret:
break # Exit the loop when the video ends
# Perform object detection
detections = model.predict(frame, threshold=0.5)
# Mark the detected objects
for i, box in enumerate(detections.xyxy):
x1, y1, x2, y2 = map(int, box)
class_id = int(detections.class_id[i])
# Get the class name using class_id
label = class_names.get(str(class_id), "Unknown")
confidence = detections.confidence[i]
# Draw the bounding box (colored and thick)
color = (255, 255, 255) # White color
thickness = 7 # Thickness
cv2.rectangle(frame, (x1, y1), (x2, y2), color, thickness)
# Display the label and confidence score (in white color and readable font)
text = f"{label} ({confidence:.2f})"
font = cv2.FONT_HERSHEY_SIMPLEX
font_scale = 2
font_thickness = 7
text_size = cv2.getTextSize(text, font, font_scale, font_thickness)[0]
text_x = x1
text_y = y1 - 10
cv2.putText(frame, text, (text_x, text_y), font, font_scale, (0, 0, 255), font_thickness, cv2.LINE_AA)
# Display the results
resized_frame = cv2.resize(frame, (960, 540))
cv2.imshow('Labeled Video', resized_frame)
# Save the output
out.write(resized_frame)
# Exit when 'q' key is pressed
if cv2.waitKey(1) & 0xFF == ord('q'):
break
# Release resources
cap.release()
out.release() # Release the output video
cv2.destroyAllWindows()
Output:
Fine-tuning is where RF-DETR really shines especially if you’re working with niche or smaller datasets:
from rfdetr import RFDETRBase
model = RFDETRBase()
model.train(
dataset_dir="<DATASET_PATH>",
epochs=10,
batch_size=4,
grad_accum_steps=4,
lr=1e-4
)
During training, RF-DETR will produce:
As an example, Roboflow Team has used a mahjong tile recognition dataset, a part of the RF100-VL benchmark that contains over 2,000 images. This guide demonstrates how to download the dataset, install the necessary tools, and fine-tune the model on your custom data.
Refer to this blog to know more.
The resulting display should show the ground truth on one side and the model’s detections on the other. In our example, RF-DETR correctly identifies most mahjong tiles, with only minor misdetections that can be improved with further training.
Important Note:
RF-DETR is one of the best real-time DETR-based models, offering a strong balance between accuracy, speed, and domain adaptability. If you need a real-time, transformer-based detector that avoids post-processing overhead and generalizes beyond COCO, this is a top contender. However, YOLOv8 still holds an edge in raw speed for some applications.
A round of applause to the Roboflow ML team – Peter Robicheaux, James Gallagher, Joseph Nelson, Isaac Robinson.
Peter Robicheaux, James Gallagher, Joseph Nelson, Isaac Robinson. (Mar 20, 2025). RF-DETR: A SOTA Real-Time Object Detection Model. Roboflow Blog: https://blog.roboflow.com/rf-detr/
Roboflow’s RF-DETR represents a new generation of real-time object detection, balancing high accuracy, domain adaptability, and low latency in a single model. Whether you’re building a cutting-edge robotics system or deploying on resource-limited edge devices, RF-DETR offers a versatile and future-proof solution.
What are your thoughts? Let me know in the comment section.