Nvidia launched the latest Small Language Model (SLM) called Nemotron-Mini-4B-Instruct. SLM is the distilled, quantized, fine-tuned version of the larger base model. SLM is primarily developed for speed and on-device deployment.Nemotron-mini-4B is a fine-tuned version of Nvidia Minitron-4B-Base, which was a pruned and distilled version of Nemotron-4 15B. This instruct model optimizes roleplay, RAG QA, and function calling in English. Trained between February 2024 and August 2024, it incorporates the latest events and developments worldwide.
This article explores Nvidia’s Nemotron-Mini-4B-Instruct, a Small Language Model (SLM). We will discuss its evolution from the larger Nemotron-4 15B model, focusing on its distilled and fine-tuned nature for speed and on-device deployment. Additionally, we highlight its training period from February to August 2024, showcasing how it incorporates the latest global developments, making it a powerful tool in real-time AI applications.
This article was published as a part of the Data Science Blogathon.
Small language models (SLMs) serve as compact versions of large language models, designed to perform NLP tasks while using reduced computational resources. They optimize efficiency and speed, often delivering good performance on specific tasks with fewer parameters. These features make them ideal for edge devices or on-device computing with limited memory and processing power. These categories of models are less powerful than the LLM but can do a better job for domain-focused tasks.
Typically, developers train or fine-tune small language models (SLMs) from large language models (LLMs) using various techniques that reduce the model’s size while maintaining a reasonable level of performance.
These are some of the cutting-edge techniques used to tune SLM.
Small Language Models (SLMs) play a crucial role in the current AI landscape due to their efficiency, scalability, and accessibility. Here are some important:
NVIDIA, at Gamescom 2024 annouced fisrt on-device SLM for improving the conversational abilities of game-characters. The game Mecha BREAK by Amazing Seasun Games utilize the NVIDIA ACE suite which is digital human technologies that provide speech, intelligence and animation powered by generative AI.
Creating a robust development environment is essential for the successful development of your chatbot. This step involves configuring the necessary tools, libraries, and frameworks that will enable you to write, test, and refine your code efficiently.
First, Create an anaconda environment( Anaconda). Put the below command in your terminal.
# Create conda env
$ conda create -n nemotron python=3.11
It will create a Python 3.11 environment named nemotron.
Setting up a development environment is a crucial step in building your chatbot, as it provides the necessary tools and frameworks for coding and testing. We’ll walk you through the process of activating your development environment, ensuring you have everything you need to bring your chatbot to life seamlessly.
# Create a deve folder and activate the anaconda env
$ mkdir nemotron-dev
$ cd nemotron-dev
# Activaing nemotron conda env
$ conda activate nemotron
First, install PyTorch according to your OS to set up your developer environment. Then, install transformers, and Langchain using PIP.
# Install Pytorch (Windows) for GPU
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
# Install PyTorch (Windows) CPU
pip install torch torchvision torchaudio
Second, Install transformers and langchain.
# Install transformers, Langchain
pip install transformers, langchain
Have you ever wondered how to create a chatbot that can hold a conversation? In this section, we will guide you through the code implementation of a simple chatbot. You’ll learn about the key components, programming languages, and libraries involved in building a functional conversational agent, enabling you to design an engaging and interactive user experience.
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("nvidia/Nemotron-Mini-4B-Instruct")
model = AutoModelForCausalLM.from_pretrained("nvidia/Nemotron-Mini-4B-Instruct")
# Use the prompt template
messages = [
{
"role": "system",
"content": "You are friendly chatbot, reply on style of a Professor",
},
{"role": "user", "content": "What is Quantum Entanglement?"},
]
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
outputs = model.generate(tokenized_chat, max_new_tokens=128)
print(tokenizer.decode(outputs[0]))
Here, we downlaod the Nemotron-Mini-4B-Instruct(Nemo) from Hugginface Hub through transformers AutoModelForCausalLM and tokenizer using AutoTokenizer.
Create a message template for a professor chatbot. and asking the question “What is Quantum Entanglement?”
Let see , how Nemo answer that question.
Wow, It answered pretty well. We will now create a more user-friendly chatbot to chat with it continuously.
We will explore the process of building an advanced user-friendly chatbot that not only meets the needs of users but also enhances their interaction experience. We’ll discuss the essential components, design principles, and technologies involved in creating a chatbot that is intuitive, responsive, and capable of understanding user intent, ultimately bridging the gap between technology and user satisfaction.
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, TextIteratorStreamer
from threading import Thread
import time
class PirateBot:
def __init__(self, model_name="nvidia/Nemotron-Mini-4B-Instruct"):
print("Ahoy! Yer pirate bot be loadin' the model. Stand by, ye scurvy dog!")
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModelForCausalLM.from_pretrained(model_name)
# Move model to GPU if available
self.device = "cuda" if torch.cuda.is_available() else "cpu"
self.model.to(self.device)
print(f"Arrr! The model be ready on {self.device}!")
self.messages = [
{
"role": "system",
"content": "You are a friendly chatbot who always responds in the style of a pirate",
}
]
def generate_response(self, user_input, max_new_tokens=1024):
self.messages.append({"role": "user", "content": user_input})
tokenized_chat = self.tokenizer.apply_chat_template(
self.messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt"
).to(self.device)
streamer = TextIteratorStreamer(self.tokenizer, timeout=10., skip_prompt=True, skip_special_tokens=True)
generation_kwargs = dict(
inputs=tokenized_chat,
max_new_tokens=max_new_tokens,
streamer=streamer,
do_sample=True,
top_p=0.95,
top_k=50,
temperature=0.7,
num_beams=1,
)
thread = Thread(target=self.model.generate, kwargs=generation_kwargs)
thread.start()
print("Pirate's response: ", end="", flush=True)
generated_text = ""
for new_text in streamer:
print(new_text, end="", flush=True)
generated_text += new_text
time.sleep(0.05) # Add a small delay for a more natural feel
print("\n")
self.messages.append({"role": "assistant", "content": generated_text.strip()})
return generated_text.strip()
def chat(self):
print("Ahoy, matey! I be yer pirate chatbot. What treasure of knowledge ye be seekin'?")
while True:
user_input = input("You: ")
if user_input.lower() in ['exit', 'quit', 'goodbye']:
print("Farewell, ye landlubber! May fair winds find ye!")
break
try:
self.generate_response(user_input)
except Exception as e:
print(f"Blimey! We've hit rough seas: {str(e)}")
if __name__ == "__main__":
bot = PirateBot()
bot.chat()
The above code consists of three functions:
The init function is mostly self-explanatory, it has a tokenizer, model, device, and response template for our Pirate Bot.
Generate Response function has two inputs user_input and max_new_tokens, User Input will append to a list called message and the role will be the user. The “self.message” will track the conversation history between the user and the assistant. The TextIteratorStreamer
creates a streamer object that handles the live streaming of the model’s response, allowing us to print the output as it generates and creating a more natural conversation feel.
Generating the response uses a new thread to run the generate function from the model, which generates the assistant’s response. The streamer starts outputting the text as it is generated by the model in real time.
The response is printed piece by piece as it’s generated, simulating a typing effect. A small delay (time.sleep(0.05)) adds a pause between outputs for a more natural feel.
We will now delve into the testing phase of our chatbot, focusing on its knowledge capabilities and responsiveness. By engaging with the bot through various queries, we aim to evaluate its ability to provide accurate and relevant information, highlighting the effectiveness of the underlying Small Language Model (SLM) in delivering meaningful interactions.
Staring the interface of this chatbot
We will ask Nemo different type of question to explore its knowledge capabilities.
Output:
Output:
The traveling salesman algorithm finds the shortest path between two points, such as from the restaurant to the delivery location. All map services use this algorithm to provide navigation routes for driving, and internet service providers use it to deliver responses to queries.
Output:
Output:
We can see that the model works somewhat better in all the questions. We have asked for different types of questions from different areas of the subjects.
Nemotron Mini 4B is a very capable model for business applications. It is already used by a game company with Nvidia ACE suite. Nemotron Mini 4B is just the start of the cutting-edge application of Generative AI models in the Gaming industries which will be directly on the player’s computer and enhance the player’s gaming experience. It is the tip of the iceberg in the coming days we will explore more ideas around the SLM model.
Explore the code behind this article on GitHub!
A. SLMs are more resource-efficient than LLMs. They are specifically built for on-device, IoTs, and edge devices.
A. Yes, you can fine-tune SLMs for specific tasks such as text classification, chatbots, generating bills for healthcare services, customer care, and in-game dialogue and characters.
A. Yes, You can start using Nemotron-Mini-4B-Instruct directly through Ollama. Just install Ollama and then type Ollama run nemotron-mini-4b-instruct. That’s all you can start asking questions directly on the command line.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.