How to Work with Nvidia Nemotron-Mini-4B-Instruct?

Avijit Biswas Last Updated : 29 Nov, 2024

8 min read

Introduction

Nvidia launched the latest Small Language Model (SLM) called Nemotron-Mini-4B-Instruct. SLM is the distilled, quantized, fine-tuned version of the larger base model. SLM is primarily developed for speed and on-device deployment.Nemotron-mini-4B is a fine-tuned version of Nvidia Minitron-4B-Base, which was a pruned and distilled version of Nemotron-4 15B. This instruct model optimizes roleplay, RAG QA, and function calling in English. Trained between February 2024 and August 2024, it incorporates the latest events and developments worldwide.

This article explores Nvidia’s Nemotron-Mini-4B-Instruct, a Small Language Model (SLM). We will discuss its evolution from the larger Nemotron-4 15B model, focusing on its distilled and fine-tuned nature for speed and on-device deployment. Additionally, we highlight its training period from February to August 2024, showcasing how it incorporates the latest global developments, making it a powerful tool in real-time AI applications.

Learning Outcomes

Understand the architecture and optimization techniques behind Small Language Models (SLMs) like Nvidia’s Nemotron-Mini-4B-Instruct.
Learn how to set up a development environment for implementing SLMs using Conda and install essential libraries.
Gain hands-on experience in coding a chatbot that utilizes the Nemotron-Mini-4B-Instruct model for interactive conversations.
Explore real-world applications of SLMs in gaming and other industries, highlighting their advantages over larger models.
Discover the difference between SLMs and LLMs, including their resource efficiency and adaptability for specific tasks.

This article was published as a part of the Data Science Blogathon.

Introduction
What are Small Language Models (SLMs)?
Training Techniques for Small Language Models
Significance of SLMs in Today’s AI Landscape
Real-World Applications of Nemotron-Mini-4B
Setting Up Your Development Environment
Code Implementation for a Simple Chatbot
Building an Advanced User-Friendly Chatbot
Testing the Chatbot: Exploring Its Knowledge Capabilities
Conclusion
Frequently Asked Questions

What are Small Language Models (SLMs)?

Small language models (SLMs) serve as compact versions of large language models, designed to perform NLP tasks while using reduced computational resources. They optimize efficiency and speed, often delivering good performance on specific tasks with fewer parameters. These features make them ideal for edge devices or on-device computing with limited memory and processing power. These categories of models are less powerful than the LLM but can do a better job for domain-focused tasks.

Training Techniques for Small Language Models

Typically, developers train or fine-tune small language models (SLMs) from large language models (LLMs) using various techniques that reduce the model’s size while maintaining a reasonable level of performance.

Knowledge Distillation: The LLM is used to train the smaller model where the LLM works as an instructor and the SLM as a train. The small model learns to mimic the instructor’s output, capturing the essential knowledge while reducing complexity.
Parameter Pruning: The training process removes redundant or less important parameters from the LLM, reducing the model size without drastically affecting performance.
Quantization: Model weights are converted from higher precision formats, such as 32-bit, to lower precision formats like 8-bit or 4-bit, which reduces memory usage and speeds up computations.
Task-Specific Fine-Turning: A pre-traA pre-trained LLM undergoes fine-tuning on a specific task using a smaller dataset, optimizing the smaller model for targeted tasks like roleplaying and QA chat.

These are some of the cutting-edge techniques used to tune SLM.

Significance of SLMs in Today’s AI Landscape

Small Language Models (SLMs) play a crucial role in the current AI landscape due to their efficiency, scalability, and accessibility. Here are some important:

Resource Efficiency: SLMs require significantly less computational power, memory and storage making them ideal for on-device, mobile application.
Faster Inference: Their smaller size allows for quicker inferences times, which is essential for real-time applications like chatbots, voice assistants and IoT devices.
Cost-Effective: Training and deploying large language models can be expensive, SLMs offer a more cost-effective solution for business and developers, democratizing AI access.
Sustainability: Due to their size, users can fine-tune SLMs more easily for specific tasks or niche applications, enabling greater adaptability across a wide range of industries, including healthcare and retail.

Real-World Applications of Nemotron-Mini-4B

NVIDIA, at Gamescom 2024 annouced fisrt on-device SLM for improving the conversational abilities of game-characters. The game Mecha BREAK by Amazing Seasun Games utilize the NVIDIA ACE suite which is digital human technologies that provide speech, intelligence and animation powered by generative AI.

Setting Up Your Development Environment

Creating a robust development environment is essential for the successful development of your chatbot. This step involves configuring the necessary tools, libraries, and frameworks that will enable you to write, test, and refine your code efficiently.

Step1: Create a Conda Environment

First, Create an anaconda environment( Anaconda). Put the below command in your terminal.

# Create conda env
$ conda create -n nemotron python=3.11

It will create a Python 3.11 environment named nemotron.

Step2: Activating the Development Environment

Setting up a development environment is a crucial step in building your chatbot, as it provides the necessary tools and frameworks for coding and testing. We’ll walk you through the process of activating your development environment, ensuring you have everything you need to bring your chatbot to life seamlessly.

# Create a deve folder and activate the anaconda env
$ mkdir nemotron-dev
$ cd nemotron-dev
# Activaing nemotron conda env
$ conda activate nemotron

Step3: Installing Essential Libraries

First, install PyTorch according to your OS to set up your developer environment. Then, install transformers, and Langchain using PIP.

# Install Pytorch (Windows) for GPU
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# Install PyTorch (Windows) CPU
pip install torch torchvision torchaudio

Second, Install transformers and langchain.

# Install transformers, Langchain
pip install transformers, langchain

Code Implementation for a Simple Chatbot

Have you ever wondered how to create a chatbot that can hold a conversation? In this section, we will guide you through the code implementation of a simple chatbot. You’ll learn about the key components, programm ing lan guages, and libraries involved in building a functional conversational agent, enabling you to design an engaging and interactive user experience.

from transformers import AutoTokenizer, AutoModelForCausalLM

# Load the tokenizer and model
tokenizer  = AutoTokenizer.from_pretrained("nvidia/Nemotron-Mini-4B-Instruct")
model = AutoModelForCausalLM.from_pretrained("nvidia/Nemotron-Mini-4B-Instruct")

# Use the prompt template
messages = [
    {
        "role": "system",
        "content": "You are friendly chatbot, reply on style of a Professor",
    },
    {"role": "user", "content": "What is Quantum Entanglement?"},
 ]
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")

outputs = model.generate(tokenized_chat, max_new_tokens=128)
print(tokenizer.decode(outputs[0]))

Here, we downlaod the Nemotron-Mini-4B-Instruct(Nemo) from Hugginface Hub through transformers AutoModelForCausalLM and tokenizer using AutoTokenizer.

Creating Message Template

Create a message template for a professor chatbot. and asking the question “What is Quantum Entanglement?”

Let see , how Nemo answer that question.

Wow, It answered pretty well. We will now create a more user-friendly chatbot to chat with it continuously.

Building an Advanced User-Friendly Chatbot

We will explore the process of building an advanced user-friendly chatbot that not only meets the needs of users but also enhances their interaction experience. We’ll discuss the essential components, design principles, and technologies involved in creating a chatbot that is intuitive, responsive, and capable of understanding user intent, ultimately bridging the gap between technology and user satisfaction.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, TextIteratorStreamer
from threading import Thread
import time

class PirateBot:
    def __init__(self, model_name="nvidia/Nemotron-Mini-4B-Instruct"):
        print("Ahoy! Yer pirate bot be loadin' the model. Stand by, ye scurvy dog!")
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForCausalLM.from_pretrained(model_name)
        
        # Move model to GPU if available
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        self.model.to(self.device)
        
        print(f"Arrr! The model be ready on {self.device}!")
        
        self.messages = [
            {
                "role": "system",
                "content": "You are a friendly chatbot who always responds in the style of a pirate",
            }
        ]

    def generate_response(self, user_input, max_new_tokens=1024):
        self.messages.append({"role": "user", "content": user_input})
        
        tokenized_chat = self.tokenizer.apply_chat_template(
            self.messages, 
            tokenize=True, 
            add_generation_prompt=True, 
            return_tensors="pt"
        ).to(self.device)

        streamer = TextIteratorStreamer(self.tokenizer, timeout=10., skip_prompt=True, skip_special_tokens=True)
        
        generation_kwargs = dict(
            inputs=tokenized_chat,
            max_new_tokens=max_new_tokens,
            streamer=streamer,
            do_sample=True,
            top_p=0.95,
            top_k=50,
            temperature=0.7,
            num_beams=1,
        )

        thread = Thread(target=self.model.generate, kwargs=generation_kwargs)
        thread.start()

        print("Pirate's response: ", end="", flush=True)
        generated_text = ""
        for new_text in streamer:
            print(new_text, end="", flush=True)
            generated_text += new_text
            time.sleep(0.05)  # Add a small delay for a more natural feel
        print("\n")

        self.messages.append({"role": "assistant", "content": generated_text.strip()})
        return generated_text.strip()

    def chat(self):
        print("Ahoy, matey! I be yer pirate chatbot. What treasure of knowledge ye be seekin'?")
        while True:
            user_input = input("You: ")
            if user_input.lower() in ['exit', 'quit', 'goodbye']:
                print("Farewell, ye landlubber! May fair winds find ye!")
                break
            try:
                self.generate_response(user_input)
            except Exception as e:
                print(f"Blimey! We've hit rough seas: {str(e)}")

if __name__ == "__main__":
    bot = PirateBot()
    bot.chat()

The above code consists of three functions:

__init__ function
generate_response
chat

The init function is mostly self-explanatory, it has a tokenizer, model, device, and response template for our Pirate Bot.

Generate Response function has two inputs user_input and max_new_tokens, User Input will append to a list called message and the role will be the user. The “self.message” will track the conversation history between the user and the assistant. The TextIteratorStreamer creates a streamer object that handles the live streaming of the model’s response, allowing us to print the output as it generates and creating a more natural conversation feel.

Generating the response uses a new thread to run the generate function from the model, which generates the assistant’s response. The streamer starts outputting the text as it is generated by the model in real time.

The response is printed piece by piece as it’s generated, simulating a typing effect. A small delay (time.sleep(0.05)) adds a pause between outputs for a more natural feel.

Testing the Chatbot: Exploring Its Knowledge Capabilities

We will now delve into the testing phase of our chatbot, focusing on its knowledge capabilities and responsiveness. By engaging with the bot through various queries, we aim to evaluate its ability to provide accurate and relevant information, highlighting the effectiveness of the underlying Small Language Model (SLM) in delivering meaningful interactions.

Staring the interface of this chatbot

We will ask Nemo different type of question to explore its knowledge capabilities.

What is Quantum Teleportation?

Output:

What is Gender Violation?

Output:

Explain the Travelling Sales Man(TSM) algorithm

The traveling salesman algorithm finds the shortest path between two points, such as from the restaurant to the delivery location. All map services use this algorithm to provide navigation routes for driving, and internet service providers use it to deliver responses to queries.

Output:

Explain the Travelling Sales Man(TSM) algorithm

Implement Travelling Sale Man in Python

Output:

We can see that the model works somewhat better in all the questions. We have asked for different types of questions from different areas of the subjects.

Conclusion

Nemotron Mini 4B is a very capable model for business applications. It is already used by a game company with Nvidia ACE suite. Nemotron Mini 4B is just the start of the cutting-edge application of Generative AI models in the Gaming industries which will be directly on the player’s computer and enhance the player’s gaming experience. It is the tip of the iceberg in the coming days we will explore more ideas around the SLM model.

Explore the code behind this article on GitHub!

Key Takeaways

SLMs use fewer resources while delivering faster inference, making them suitable for real-time applications.
Nemotron-Mini-4B-Instruct is an industry-ready model, already used in games through NVIDIA ACE.
The model is fine-tuned from the Nemotron-4 base model.
Nemotron-Mini excels in applications designed for role-playing, answering questions from documents, and function calling.

Frequently Asked Questions

Q1. How are SLMs different from LLMs?

A. SLMs are more resource-efficient than LLMs. They are specifically built for on-device, IoTs, and edge devices.

Q2. Can SLMs be fine-tuned for specific tasks?

A. Yes, you can fine-tune SLMs for specific tasks such as text classification, chatbots, generating bills for healthcare services, customer care, and in-game dialogue and characters.

Q3. Can Nemotron-Mini-4B-Instruct be used from Ollama?

A. Yes, You can start using Nemotron-Mini-4B-Instruct directly through Ollama. Just install Ollama and then type Ollama run nemotron-mini-4b-instruct. That’s all you can start asking questions directly on the command line.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Avijit Biswas

A self-taught, project-driven learner, love to work on complex projects on deep learning, Computer vision, and NLP. I always try to get a deep understanding of the topic which may be in any field such as Deep learning, Machine learning, or Physics. Love to create content on my learning. Try to share my understanding with the worlds.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to Generative AI

Introduction to Generative AI applications

No-code Generative AI app development

Code-focused Generative AI App Development

Introduction to Responsible AI

LLMS

Prompt Engineering

Finetuning LLMs

Training LLMs from Scratch

Langchain

RAG

LlamaIndex

Stable Diffusion

How to Work with Nvidia Nemotron-Mini-4B-Instruct?

Introduction

Learning Outcomes

Table of contents

What are Small Language Models (SLMs)?

Training Techniques for Small Language Models

Significance of SLMs in Today’s AI Landscape

Real-World Applications of Nemotron-Mini-4B

Setting Up Your Development Environment

Step1: Create a Conda Environment

Step2: Activating the Development Environment

Step3: Installing Essential Libraries

Code Implementation for a Simple Chatbot

Creating Message Template

Building an Advanced User-Friendly Chatbot

Testing the Chatbot: Exploring Its Knowledge Capabilities

What is Quantum Teleportation?

What is Gender Violation?

Explain the Travelling Sales Man(TSM) algorithm

Implement Travelling Sale Man in Python

Conclusion

Key Takeaways

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap