How to Work with Nvidia Nemotron-Mini-4B-Instruct?

Avijit Biswas Last Updated : 29 Nov, 2024
8 min read

Introduction

Nvidia launched the latest Small Language Model (SLM) called Nemotron-Mini-4B-Instruct. SLM is the distilled, quantized, fine-tuned version of the larger base model. SLM is primarily developed for speed and on-device deployment.Nemotron-mini-4B is a fine-tuned version of Nvidia Minitron-4B-Base, which was a pruned and distilled version of Nemotron-4 15B. This instruct model optimizes roleplay, RAG QA, and function calling in English. Trained between February 2024 and August 2024, it incorporates the latest events and developments worldwide.

This article explores Nvidia’s Nemotron-Mini-4B-Instruct, a Small Language Model (SLM). We will discuss its evolution from the larger Nemotron-4 15B model, focusing on its distilled and fine-tuned nature for speed and on-device deployment. Additionally, we highlight its training period from February to August 2024, showcasing how it incorporates the latest global developments, making it a powerful tool in real-time AI applications.

Learning Outcomes

  • Understand the architecture and optimization techniques behind Small Language Models (SLMs) like Nvidia’s Nemotron-Mini-4B-Instruct.
  • Learn how to set up a development environment for implementing SLMs using Conda and install essential libraries.
  • Gain hands-on experience in coding a chatbot that utilizes the Nemotron-Mini-4B-Instruct model for interactive conversations.
  • Explore real-world applications of SLMs in gaming and other industries, highlighting their advantages over larger models.
  • Discover the difference between SLMs and LLMs, including their resource efficiency and adaptability for specific tasks.

This article was published as a part of the Data Science Blogathon.

What are Small Language Models (SLMs)?

Small language models (SLMs) serve as compact versions of large language models, designed to perform NLP tasks while using reduced computational resources. They optimize efficiency and speed, often delivering good performance on specific tasks with fewer parameters. These features make them ideal for edge devices or on-device computing with limited memory and processing power. These categories of models are less powerful than the LLM but can do a better job for domain-focused tasks.

Training Techniques for Small Language Models

Typically, developers train or fine-tune small language models (SLMs) from large language models (LLMs) using various techniques that reduce the model’s size while maintaining a reasonable level of performance.

Training Techniques for Small Language Models
  • Knowledge Distillation: The LLM is used to train the smaller model where the LLM works as an instructor and the SLM as a train. The small model learns to mimic the instructor’s output, capturing the essential knowledge while reducing complexity.
  • Parameter Pruning: The training process removes redundant or less important parameters from the LLM, reducing the model size without drastically affecting performance.
  • Quantization: Model weights are converted from higher precision formats, such as 32-bit, to lower precision formats like 8-bit or 4-bit, which reduces memory usage and speeds up computations.
  • Task-Specific Fine-Turning: A pre-traA pre-trained LLM undergoes fine-tuning on a specific task using a smaller dataset, optimizing the smaller model for targeted tasks like roleplaying and QA chat.

These are some of the cutting-edge techniques used to tune SLM.

Significance of SLMs in Today’s AI Landscape

Small Language Models (SLMs) play a crucial role in the current AI landscape due to their efficiency, scalability, and accessibility. Here are some important:

  • Resource Efficiency: SLMs require significantly less computational power, memory and storage making them ideal for on-device, mobile application.
  • Faster Inference: Their smaller size allows for quicker inferences times, which is essential for real-time applications like chatbots, voice assistants and IoT devices.
  • Cost-Effective: Training and deploying large language models can be expensive, SLMs offer a more cost-effective solution for business and developers, democratizing AI access.
  • Sustainability: Due to their size, users can fine-tune SLMs more easily for specific tasks or niche applications, enabling greater adaptability across a wide range of industries, including healthcare and retail.

Real-World Applications of Nemotron-Mini-4B

NVIDIA, at Gamescom 2024 annouced fisrt on-device SLM for improving the conversational abilities of game-characters. The game Mecha BREAK by Amazing Seasun Games utilize the NVIDIA ACE suite which is digital human technologies that provide speech, intelligence and animation powered by generative AI.

Real-World Applications of Nemotron-Mini-4B

Setting Up Your Development Environment

Creating a robust development environment is essential for the successful development of your chatbot. This step involves configuring the necessary tools, libraries, and frameworks that will enable you to write, test, and refine your code efficiently.

Step1: Create a Conda Environment

First, Create an anaconda environment( Anaconda). Put the below command in your terminal.

# Create conda env
$ conda create -n nemotron python=3.11

It will create a Python 3.11 environment named nemotron.

Step2: Activating the Development Environment

Setting up a development environment is a crucial step in building your chatbot, as it provides the necessary tools and frameworks for coding and testing. We’ll walk you through the process of activating your development environment, ensuring you have everything you need to bring your chatbot to life seamlessly.

# Create a deve folder and activate the anaconda env
$ mkdir nemotron-dev
$ cd nemotron-dev
# Activaing nemotron conda env
$ conda activate nemotron

Step3: Installing Essential Libraries

First, install PyTorch according to your OS to set up your developer environment. Then, install transformers, and Langchain using PIP.

# Install Pytorch (Windows) for GPU
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# Install PyTorch (Windows) CPU
pip install torch torchvision torchaudio

Second, Install transformers and langchain.

# Install transformers, Langchain
pip install transformers, langchain

Code Implementation for a Simple Chatbot

Have you ever wondered how to create a chatbot that can hold a conversation? In this section, we will guide you through the code implementation of a simple chatbot. You’ll learn about the key components, programming languages, and libraries involved in building a functional conversational agent, enabling you to design an engaging and interactive user experience.

from transformers import AutoTokenizer, AutoModelForCausalLM

# Load the tokenizer and model
tokenizer  = AutoTokenizer.from_pretrained("nvidia/Nemotron-Mini-4B-Instruct")
model = AutoModelForCausalLM.from_pretrained("nvidia/Nemotron-Mini-4B-Instruct")

# Use the prompt template
messages = [
    {
        "role": "system",
        "content": "You are friendly chatbot, reply on style of a Professor",
    },
    {"role": "user", "content": "What is Quantum Entanglement?"},
 ]
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")

outputs = model.generate(tokenized_chat, max_new_tokens=128)
print(tokenizer.decode(outputs[0]))

Here, we downlaod the Nemotron-Mini-4B-Instruct(Nemo) from Hugginface Hub through transformers AutoModelForCausalLM and tokenizer using AutoTokenizer.

Creating Message Template

Create a message template for a professor chatbot. and asking the question “What is Quantum Entanglement?”

Let see , how Nemo answer that question.

Creating Message Template

Wow, It answered pretty well. We will now create a more user-friendly chatbot to chat with it continuously.

Building an Advanced User-Friendly Chatbot

We will explore the process of building an advanced user-friendly chatbot that not only meets the needs of users but also enhances their interaction experience. We’ll discuss the essential components, design principles, and technologies involved in creating a chatbot that is intuitive, responsive, and capable of understanding user intent, ultimately bridging the gap between technology and user satisfaction.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, TextIteratorStreamer
from threading import Thread
import time

class PirateBot:
    def __init__(self, model_name="nvidia/Nemotron-Mini-4B-Instruct"):
        print("Ahoy! Yer pirate bot be loadin' the model. Stand by, ye scurvy dog!")
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForCausalLM.from_pretrained(model_name)
        
        # Move model to GPU if available
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        self.model.to(self.device)
        
        print(f"Arrr! The model be ready on {self.device}!")
        
        self.messages = [
            {
                "role": "system",
                "content": "You are a friendly chatbot who always responds in the style of a pirate",
            }
        ]

    def generate_response(self, user_input, max_new_tokens=1024):
        self.messages.append({"role": "user", "content": user_input})
        
        tokenized_chat = self.tokenizer.apply_chat_template(
            self.messages, 
            tokenize=True, 
            add_generation_prompt=True, 
            return_tensors="pt"
        ).to(self.device)

        streamer = TextIteratorStreamer(self.tokenizer, timeout=10., skip_prompt=True, skip_special_tokens=True)
        
        generation_kwargs = dict(
            inputs=tokenized_chat,
            max_new_tokens=max_new_tokens,
            streamer=streamer,
            do_sample=True,
            top_p=0.95,
            top_k=50,
            temperature=0.7,
            num_beams=1,
        )

        thread = Thread(target=self.model.generate, kwargs=generation_kwargs)
        thread.start()

        print("Pirate's response: ", end="", flush=True)
        generated_text = ""
        for new_text in streamer:
            print(new_text, end="", flush=True)
            generated_text += new_text
            time.sleep(0.05)  # Add a small delay for a more natural feel
        print("\n")

        self.messages.append({"role": "assistant", "content": generated_text.strip()})
        return generated_text.strip()

    def chat(self):
        print("Ahoy, matey! I be yer pirate chatbot. What treasure of knowledge ye be seekin'?")
        while True:
            user_input = input("You: ")
            if user_input.lower() in ['exit', 'quit', 'goodbye']:
                print("Farewell, ye landlubber! May fair winds find ye!")
                break
            try:
                self.generate_response(user_input)
            except Exception as e:
                print(f"Blimey! We've hit rough seas: {str(e)}")

if __name__ == "__main__":
    bot = PirateBot()
    bot.chat()

The above code consists of three functions:

  • __init__ function
  • generate_response
  • chat

The init function is mostly self-explanatory, it has a tokenizer, model, device, and response template for our Pirate Bot.

Generate Response function has two inputs user_input and max_new_tokens, User Input will append to a list called message and the role will be the user. The “self.message” will track the conversation history between the user and the assistant. The TextIteratorStreamer creates a streamer object that handles the live streaming of the model’s response, allowing us to print the output as it generates and creating a more natural conversation feel.

Generating the response uses a new thread to run the generate function from the model, which generates the assistant’s response. The streamer starts outputting the text as it is generated by the model in real time.

The response is printed piece by piece as it’s generated, simulating a typing effect. A small delay (time.sleep(0.05)) adds a pause between outputs for a more natural feel.

Testing the Chatbot: Exploring Its Knowledge Capabilities

We will now delve into the testing phase of our chatbot, focusing on its knowledge capabilities and responsiveness. By engaging with the bot through various queries, we aim to evaluate its ability to provide accurate and relevant information, highlighting the effectiveness of the underlying Small Language Model (SLM) in delivering meaningful interactions.

Staring the interface of this chatbot

Staring the interface of this chatbot

We will ask Nemo different type of question to explore its knowledge capabilities.

What is Quantum Teleportation?

Output:

What is Quantum Teleportation?

What is Gender Violation?

Output:

What is Gender Violation?

Explain the Travelling Sales Man(TSM) algorithm

The traveling salesman algorithm finds the shortest path between two points, such as from the restaurant to the delivery location. All map services use this algorithm to provide navigation routes for driving, and internet service providers use it to deliver responses to queries.

Output:

Explain the Travelling Sales Man(TSM) algorithm

Implement Travelling Sale Man in Python

Output:

Implement Travelling Sale Man in Python

We can see that the model works somewhat better in all the questions. We have asked for different types of questions from different areas of the subjects.

Conclusion

Nemotron Mini 4B is a very capable model for business applications. It is already used by a game company with Nvidia ACE suite. Nemotron Mini 4B is just the start of the cutting-edge application of Generative AI models in the Gaming industries which will be directly on the player’s computer and enhance the player’s gaming experience. It is the tip of the iceberg in the coming days we will explore more ideas around the SLM model.

Explore the code behind this article on GitHub!

Key Takeaways

  • SLMs use fewer resources while delivering faster inference, making them suitable for real-time applications.
  • Nemotron-Mini-4B-Instruct is an industry-ready model, already used in games through NVIDIA ACE.
  • The model is fine-tuned from the Nemotron-4 base model.
  • Nemotron-Mini excels in applications designed for role-playing, answering questions from documents, and function calling.

Frequently Asked Questions

Q1. How are SLMs different from LLMs?

A. SLMs are more resource-efficient than LLMs. They are specifically built for on-device, IoTs, and edge devices.

Q2. Can SLMs be fine-tuned for specific tasks?

A. Yes, you can fine-tune SLMs for specific tasks such as text classification, chatbots, generating bills for healthcare services, customer care, and in-game dialogue and characters.

Q3. Can Nemotron-Mini-4B-Instruct be used from Ollama?

A. Yes, You can start using Nemotron-Mini-4B-Instruct directly through Ollama. Just install Ollama and then type Ollama run nemotron-mini-4b-instruct. That’s all you can start asking questions directly on the command line.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

A self-taught, project-driven learner, love to work on complex projects on deep learning, Computer vision, and NLP. I always try to get a deep understanding of the topic which may be in any field such as Deep learning, Machine learning, or Physics. Love to create content on my learning. Try to share my understanding with the worlds.

Responses From Readers

Clear

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details