Deep learning GPU benchmarks has revolutionized the way we solve complex problems, from image recognition to natural language processing. However, while training these models often relies on high-performance GPUs, deploying them effectively in resource-constrained environments such as edge devices or systems with limited hardware presents unique challenges. CPUs, being widely available and cost-efficient, often serve as the backbone for inference in such scenarios. But how do we ensure that models deployed on CPUs deliver optimal performance without compromising accuracy?
This article dives into the benchmarking of deep learning model inference on CPUs, focusing on three critical metrics: latency, CPU utilization and Memory Utilization. Using a spam classification example, We explore how popular frameworks like PyTorch, TensorFlow, JAX , and ONNX Runtime handle inference workloads. By the end, you’ll have a clear understanding of how to measure performance, optimize deployments, and select the right tools and frameworks for CPU-based inference in resource-constrained environments.
Impact: Optimal inference execution can save a significant amount of money and free up resources for other workloads.
psutil
and time
to collect accurate performance data and optimize inference.This article was published as a part of the Data Science Blogathon.
Inference speed is essential for user experience and operational efficiency in machine learning applications. Runtime optimization plays a key role in enhancing this by streamlining execution. Using hardware-accelerated libraries like ONNX Runtime takes advantage of optimizations tailored to specific architectures, reducing latency (time per inference).
Additionally, lightweight model formats such as ONNX minimize overhead, enabling faster loading and execution. Optimized runtimes leverage parallel processing to distribute computation across available CPU cores and improve memory management, ensuring better performance especially on systems with limited resources. This approach makes models faster and more efficient while maintaining accuracy.
To evaluate the performance of our models, we focus on three key metric:
To keep this benchmarking study focused and practical, we made the following assumptions and set a few boundaries:
These assumptions ensure the benchmarks remain relevant for developers and teams working with resource-constrained hardware or who need predictable performance without the added complexity of distributed systems.
We’ll explore the essential tools and frameworks used to benchmark and optimize deep learning model inference on CPUs, providing insights into their capabilities for efficient execution in resource-constrained environments.
We are utilizing github codespace (virtual machine) with below configuration:
The versions of the packages used are as follows and this primary include five deep learning inference libraries: Tensorflow, Pytorch, ONNX Runtime, JAX, and OpenVINO:
!pip install numpy==1.26.4
!pip install torch==2.2.2
!pip install tensorflow==2.16.2
!pip install onnx==1.17.0
!pip install onnxruntime==1.17.0!pip install jax==0.4.30
!pip install jaxlib==0.4.30
!pip install openvino==2024.6.0
!pip install matplotlib==3.9.3
!pip install Matplotlib: 3.4.3
!pip install Pillow: 8.3.2
!pip install psutil: 5.8.0
Since model inference consists of performing a few matrix operations between network weights and input data, it doesn’t require model training or datasets. For our example the benchmarking process, we simulated a standard classification use case. This simulates common binary classification tasks like spam detection and loan application decisions(approval or denial). The binary nature of these problems makes them ideal for comparing model performance across different frameworks. This setup reflects real-world systems but allows us to focus on inference performance across frameworks without needing large datasets or pre-trained models.
The sample task involves predicting whether a given sample is spam or not (loan approval or denial), based on a set of input features. This binary classification problem is computationally efficient, allowing for a focused analysis of inference performance without the complexity of multi-class classification tasks.
To simulate real-world email data, we generated randomly input. These embeddings mimic the type of data that might be processed by spam filters but avoid the need for external datasets. This simulated input data allows for benchmarking without relying on any specific external datasets, making it ideal for testing model inference times, memory usage, and CPU performance. Alternatively, you can use image classification, NLP task or any other deep learning tasks to perform this benchmarking process.
Model selection is a critical step in benchmarking as it directly influences the inference performance and insights gained from the profiling process. As mentioned in the previous section, for this benchmarking study, we chose a standard Classification use case, which involves identifying whether a given email is spam or not. This task is a straightforward two-class classification problem that is computationally efficient yet provides meaningful results for comparison across frameworks.
The model for the Classification task is a Feedforward Neural Network (FNN) designed for binary classification (Spam vs. Not Spam). It consists of the following layers:
self.fc1 = torch.nn.Linear(200,128)
self.fc2 = torch.nn.Linear(128, 64)
self.fc3 = torch.nn.Linear(64, 32)
self.fc4 = torch.nn.Linear(32, 16)
self.fc5 = torch.nn.Linear(16, 8)
self.fc6 = torch.nn.Linear(8, 1)
self.sigmoid = torch.nn.Sigmoid()
The model is simple yet effective for classification task.
The model architecture diagram used for benchmarking in our use case is shown below:
This workflow aims to compare the inference performance of multiple deep learning frameworks (TensorFlow, PyTorch, ONNX, JAX, and OpenVINO) using the classification task. The task involves using randomly generated input data and benchmarking each framework to measure the average time taken for a prediction.
To get started with benchmarking deep learning models, we first need to import the essential Python packages that enable seamless integration and performance evaluation.
import time
import os
import numpy as np
import torch
import tensorflow as tf
from tensorflow.keras import Input
import onnxruntime as ort
import matplotlib.pyplot as plt
from PIL import Image
import psutil
import jax
import jax.numpy as jnp
from openvino.runtime import Core
import csv
os.environ["CUDA_VISIBLE_DEVICES"] = "-1" # Disable GPU
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3" #Suppress Tensorflow Log
In this step, we randomly generate input data for spam classification:
We generate randome data using NumPy to serve as input features for the models.
#Generate dummy data
input_data = np.random.rand(1000, 200).astype(np.float32)
In this step, we define the netwrok architecture or setup the model from each deep learning framework( Tensorflow, PyTorch, ONNX, JAX and OpenVINO). Each framework requires a specific methods for loading models and setting them up for inference.
class PyTorchModel(torch.nn.Module):
def __init__(self):
super(PyTorchModel, self).__init__()
self.fc1 = torch.nn.Linear(200, 128)
self.fc2 = torch.nn.Linear(128, 64)
self.fc3 = torch.nn.Linear(64, 32)
self.fc4 = torch.nn.Linear(32, 16)
self.fc5 = torch.nn.Linear(16, 8)
self.fc6 = torch.nn.Linear(8, 1)
self.sigmoid = torch.nn.Sigmoid()
def forward(self, x):
x = torch.relu(self.fc1(x))
x = torch.relu(self.fc2(x))
x = torch.relu(self.fc3(x))
x = torch.relu(self.fc4(x))
x = torch.relu(self.fc5(x))
x = self.sigmoid(self.fc6(x))
return x
# Create PyTorch model
pytorch_model = PyTorchModel()
tensorflow_model = tf.keras.Sequential([
Input(shape=(200,)),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(32, activation='relu'),
tf.keras.layers.Dense(16, activation='relu'),
tf.keras.layers.Dense(8, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid')
])
tensorflow_model.compile()
def jax_model(x):
x = jax.nn.relu(jnp.dot(x, jnp.ones((200, 128))))
x = jax.nn.relu(jnp.dot(x, jnp.ones((128, 64))))
x = jax.nn.relu(jnp.dot(x, jnp.ones((64, 32))))
x = jax.nn.relu(jnp.dot(x, jnp.ones((32, 16))))
x = jax.nn.relu(jnp.dot(x, jnp.ones((16, 8))))
x = jax.nn.sigmoid(jnp.dot(x, jnp.ones((8, 1))))
return x
# Convert PyTorch model to ONNX
dummy_input = torch.randn(1, 200)
onnx_model_path = "model.onnx"
torch.onnx.export(
pytorch_model,
dummy_input,
onnx_model_path,
export_params=True,
opset_version=11,
input_names=['input'],
output_names=['output'],
dynamic_axes={'input': {0: 'batch_size'}, 'output': {0: 'batch_size'}}
)
onnx_session = ort.InferenceSession(onnx_model_path)
# OpenVINO Model Definition
core = Core()
openvino_model = core.read_model(model="model.onnx")
compiled_model = core.compile_model(openvino_model, device_name="CPU")
This function executes benchmarking tests across different frameworks by taking three arguments: predict_function, input_data, and num_runs. By default, it executes 1,000 times but It can be increased as per requirements.
def benchmark_model(predict_function, input_data, num_runs=1000):
start_time = time.time()
process = psutil.Process(os.getpid())
cpu_usage = []
memory_usage = []
for _ in range(num_runs):
predict_function(input_data)
cpu_usage.append(process.cpu_percent())
memory_usage.append(process.memory_info().rss)
end_time = time.time()
avg_latency = (end_time - start_time) / num_runs
avg_cpu = np.mean(cpu_usage)
avg_memory = np.mean(memory_usage) / (1024 * 1024) # Convert to MB
return avg_latency, avg_cpu, avg_memory
Now that we have loaded the models, it’s time to benchmark the performance of each framework. The benchmarking process perform inference on the generated input data.
# Benchmark PyTorch model
def pytorch_predict(input_data):
pytorch_model(torch.tensor(input_data))
pytorch_latency, pytorch_cpu, pytorch_memory = benchmark_model(lambda x: pytorch_predict(x), input_data)
# Benchmark TensorFlow model
def tensorflow_predict(input_data):
tensorflow_model(input_data)
tensorflow_latency, tensorflow_cpu, tensorflow_memory = benchmark_model(lambda x: tensorflow_predict(x), input_data)
# Benchmark JAX model
def jax_predict(input_data):
jax_model(jnp.array(input_data))
jax_latency, jax_cpu, jax_memory = benchmark_model(lambda x: jax_predict(x), input_data)
# Benchmark ONNX model
def onnx_predict(input_data):
# Process inputs in batches
for i in range(input_data.shape[0]):
single_input = input_data[i:i+1] # Extract single input
onnx_session.run(None, {onnx_session.get_inputs()[0].name: single_input})
onnx_latency, onnx_cpu, onnx_memory = benchmark_model(lambda x: onnx_predict(x), input_data)
# Benchmark OpenVINO model
def openvino_predict(input_data):
# Process inputs in batches
for i in range(input_data.shape[0]):
single_input = input_data[i:i+1] # Extract single input
compiled_model.infer_new_request({0: single_input})
openvino_latency, openvino_cpu, openvino_memory = benchmark_model(lambda x: openvino_predict(x), input_data)
Here we discuss the results of performance benchmarking of previously mentioned deep learning frameworks. We compare them on – latency, CPU usage, and memory usage. We have included tabular data and plot for quick comparison.
Framework | Latency (ms) | Relative Latency (vs. PyTorch) |
PyTorch | 1.26 | 1.0 (baseline) |
TensorFlow | 6.61 | ~5.25× |
JAX | 3.15 | ~2.50× |
ONNX | 14.75 | ~11.72× |
OpenVINO | 144.84 | ~115× |
Insights:
Framework | CPU Usage (%) | Relative CPU Usage<sup>1</sup> |
PyTorch | 99.79 | ~1.00 |
TensorFlow | 112.26 | ~1.13 |
JAX | 130.03 | ~1.31 |
ONNX | 99.58 | ~1.00 |
OpenVINO | 99.32 | 1.00 (baseline) |
Insights:
Framework | Memory (MB) | Relative Memory Usage (vs. PyTorch) |
PyTorch | ~959.69 | 1.0 (baseline) |
TensorFlow | ~969.72 | ~1.01× |
JAX | ~1033.63 | ~1.08× |
ONNX | ~1033.82 | ~1.08× |
OpenVINO | ~1040.80 | ~1.08–1.09× |
Insights:
Here is the plot comparing the Performance of Deep Learning Frameworks:
In this article, we presented a comprehensive benchmarking workflow to evaluate the inference performance of prominent deep learning frameworks—TensorFlow, PyTorch, ONNX, JAX, and OpenVINO—using a spam classification task as a reference. By analyzing key metrics such as latency, CPU usage and memory consumption, the results highlighted the trade-offs between frameworks and their suitability for different deployment scenarios.
PyTorch demonstrated the most balanced performance, excelling in low latency and efficient memory usage, making it ideal for latency-sensitive applications like real-time predictions and recommendation systems. TensorFlow provided a middle-ground solution with moderately higher resource consumption. JAX showcased high computational throughput but at the cost of increased CPU utilization, which might be a limiting factor for resource-constrained environments. Meanwhile, ONNX and OpenVINO lagged in latency, with OpenVINO’s performance particularly hindered by the absence of hardware acceleration.
These findings underline the importance of aligning framework selection with deployment needs. Whether optimizing for speed, resource efficiency, or specific hardware, understanding the trade-offs is essential for effective model deployment in real-world environments.
A. PyTorch’s dynamic computation graph and efficient execution pipeline allow for low-latency inference (1.26 ms), making it well-suited for applications like recommendation systems and real-time predictions.
A. OpenVINO’s optimizations are designed for Intel hardware. Without this acceleration, its latency (144.84 ms) and memory usage (1040.8 MB) were less competitive compared to other frameworks.
A. For CPU-only setups, PyTorch is the most efficient. TensorFlow is a strong alternative for moderate workloads. Avoid frameworks like JAX unless higher CPU utilization is acceptable.
A. Framework performance depends heavily on hardware compatibility. For instance, OpenVINO excels on Intel CPUs with hardware-specific optimizations, while PyTorch and TensorFlow perform consistently across varied setups.
A. Yes, these results reflect a simple binary classification task. Performance could vary with complex architectures like ResNet or tasks like NLP or others, where these frameworks might leverage specialized optimizations.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.