Artificial intelligence (AI) is rapidly changing industries around the world, including healthcare, autonomous vehicles, banking, and customer service. While building AI models acquires a lot of attention, AI inference—the process of applying a trained model to fresh data to make predictions—is where the real-world impact occurs. As enterprises become more reliant on AI-powered applications, the demand for efficient, scalable, and low-latency inferencing solutions has never been higher.
This is where NVIDIA NIM comes into the picture. NVIDIA NIM is designed to help developers deploy AI models as microservices, simplifying the process of delivering inference solutions at scale. In this blog, we’ll dive deep into the capabilities of NIM, check some model using NIM API, and how it’s revolutionizing AI inferencing.
This article was published as a part of the Data Science Blogathon.
NVIDIA NIM is a platform that uses microservices to make AI inference easier in real-life applications. Microservices are small services that can work on their own but also come together to create larger systems that can grow. By putting ready-to-use AI models into microservices, NIM helps developers use these models quickly and easily, without needing to think about the infrastructure or how to scale it.
Let us understand key features of NVIDIA NIM below in detail:
NVIDIA NIM provides a wide range of pretrained models that are ready for immediate deployment. These models cover various AI tasks, including:
It is very good for quick responses, so it tends to work well for applications needing real-time processing. For example, in a self-driving car, choices are made using live data from sensors and cameras. NIM ensures that such AI models work fast enough with that kind of data as real-time needs demand.
Below we will see how we can access models from NVIDIA NIM:
In this section, we will explore how to evaluate the inferencing speed of various AI models. Understanding the response time of these models is crucial for applications that require real-time processing. We will begin with the Reasoning Model, specifically focusing on the Llama-3.2-3b-instruct Preview.
The Llama-3.2-3b-instruct model performs natural language processing tasks, effectively comprehending and responding to user queries. Below, we provide the necessary requirements and a step-by-step guide for setting up the environment to run this model.
Before we begin, ensure that you have the following libraries installed:
openai
: This library allows interaction with OpenAI’s models.python-dotenv
: This library helps manage environment variables.openai
python-dotenv
To ensure a clean setup, we will create a virtual environment. This helps in managing dependencies effectively without affecting the global Python environment. Follow the commands below to set it up:
python -m venv env
.\env\Scripts\activate
Now, we will implement the code to interact with the Llama-3.2-3b-instruct model. The following script initializes the model, accepts user input, and calculates the inferencing speed:
from openai import OpenAI
from dotenv import load_dotenv
import os
import time
load_dotenv()
llama_api_key = os.getenv('NVIDIA_API_KEY')
client = OpenAI(
base_url = "https://integrate.api.nvidia.com/v1",
api_key = llama_api_key)
user_input = input("What you want to ask: ")
start_time = time.time()
completion = client.chat.completions.create(
model="meta/llama-3.2-3b-instruct",
messages=[{"role":"user","content":user_input}],
temperature=0.2,
top_p=0.7,
max_tokens=1024,
stream=True
)
end_time = time.time()
for chunk in completion:
if chunk.choices[0].delta.content is not None:
print(chunk.choices[0].delta.content, end="")
response_time = end_time - start_time
print(f"\nResponse time: {response_time} seconds")
Response time
The output will include the response time, allowing you to evaluate the efficiency of the model: 0.8189256191253662 seconds
Stable Diffusion 3 Medium is a cutting-edge generative AI model designed to transform text prompts into stunning visual imagery, empowering creators and developers to explore new realms of artistic expression and innovative applications. Below, we have implemented code that demonstrates how to utilize this model for generating captivating images.
import requests
import base64
from dotenv import load_dotenv
import os
import time
load_dotenv()
invoke_url = "https://ai.api.nvidia.com/v1/genai/stabilityai/stable-diffusion-3-medium"
api_key = os.getenv('STABLE_DIFFUSION_API')
headers = {
"Authorization": f"Bearer {api_key}",
"Accept": "application/json",
}
payload = {
"prompt": input("Enter Your Image Prompt Here: "),
"cfg_scale": 5,
"aspect_ratio": "16:9",
"seed": 0,
"steps": 50,
"negative_prompt": ""
}
start_time = time.time()
response = requests.post(invoke_url, headers=headers, json=payload)
end_time = time.time()
response.raise_for_status()
response_body = response.json()
image_data = response_body.get('image')
if image_data:
image_bytes = base64.b64decode(image_data)
with open('generated_image.png', 'wb') as image_file:
image_file.write(image_bytes)
print("Image saved as 'generated_image.png'")
else:
print("No image data found in the response")
response_time = end_time - start_time
print(f"Response time: {response_time} seconds")
Output:
Response time: 3.790468692779541 seconds
With the increasing speed of AI applications, solutions are required that can execute many tasks effectively. One crucial part of this area is the NVIDIA NIM, as it helps businesses and developers use AI easily in a scalable manner through the use of pretrained AI models combined with fast GPU processing and a microservices setup. They can quickly deploy real-time applications in both cloud and edge settings, making them highly flexible and durable in the field.
A. The primary components include the inference server, pre-trained models, TensorRT optimizations, and microservices architecture for handling AI inference tasks more efficiently.
A. NVIDIA NIM is made to easily work with current AI models. It lets developers add pre-trained models from different sources into their applications. This is done by offering containerized microservices with standard APIs. This makes it easy to include these models into existing systems without a lot of changes. It basically acts like a bridge between AI models and applications.
A. NVIDIA NIM removes the hurdles in building AI applications by providing industry-standard APIs for developers, enabling them to build robust copilots, chatbots, and AI assistants. It also ensures that creating AI applications is easier for IT and DevOps teams in terms of installing AI models within their controlled environments.
A. If you are using your personal mail you will get 1000 API credits, 5000 API credits for business mail.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.