NVLM 1.0: NVIDIA’s Innovative Approach to Multimodal LLMs

Badrinarayan M Last Updated : 22 Oct, 2024

9 min read

We are going to look into the recently released multimodal large language model NVLM 1.0 by NVIDIA. These models achieve state-of-the-art results on vision-language tasks, even rivalling the leading proprietary models and open-access models (Llama 3-V 405B and InternVL 2). NVLM 1.0 shows improved text-only performance over its LLM backbone after multimodal training. NVLM is open-sourced; the model weights and code are open for the community.

NVIDIA conducts a thorough model design comparison between cross-attention-based models (e.g., Flamingo) and decoder-only multimodal LLMs (e.g., LLaVA). Based on the merits and shortcomings of both approaches, they presented a unique architecture that boosts both training efficiency and multimodal reasoning skills.

Overview

NVIDIA’s NVLM 1.0 is an open-source multimodal LLM family that excels in vision-language and text-only tasks.
NVLM 1.0 offers three architectures: decoder-only (NVLM-D), cross-attention (NVLM-X), and a hybrid model (NVLM-H).
The models demonstrate superior performance in tasks like OCR, multimodal reasoning, and high-resolution image processing.
NVLM 1.0 maintains strong text-only performance, overcoming typical multimodal training issues seen in other models.
NVIDIA emphasizes data quality and diversity in both pretraining and supervised fine-tuning for optimal model outcomes.
NVLM 1.0 is open-source, with model weights and code accessible to the community for further research and development.

Overview
Qualitative Examples of NVLM 1.0 D 74B
Comparison of NVLM with Other LLM
Limitations of other Multimodal LLMs
Addressing those limitations
NVLM: Models and Training Methods
Training Data
Results
Accessing NVLM D 72B

Qualitative Examples of NVLM 1.0 D 74B

Illustration of the powerful scene understanding capabilities of the NVLM-1.0-D 72B model. It has the common sense to identify possible risks or mishaps and accurately recommends what needs to be done right away.

Additional illustrations of the NVLM-1.0-D 72B model’s capacity to comprehend memes, a difficult undertaking including a sense of humour and familiarity with significant societal trends, context, or occurrences.

Comparison of NVLM with Other LLM

When comparing popular open-access and private multimodal LLMs with NVLM 1.0. Note that the model weights for *Llama 3-V have not been provided as of the time of this report. The outcomes show that NVLM 1.0 performs comparably to top models in both vision-language and text-only tasks. Furthermore, multimodal LLM is compared to its backbone LLM on text-only tasks.

After multimodal training, InternVL2-Llama3-76B’s text performance drastically declines. Llama 3-V 70B and 405B exhibit no degradation in text-only tasks because multimodal training freezes their LLM backbones. However, the NVLM-1.0-D 72B model shows notable improvements over its text backbone on text-only math and coding benchmarks, with average accuracy rising by 4.3 points following multimodal training.

Also Read: Nvidia Introduces VILA: Visual Language Intelligence and Edge AI 2.0

Limitations of other Multimodal LLMs

The field has advanced the possibilities of open-access multimodal LLMs to a considerable degree. Prominent groups of open models consist of LLaVA, Llama 3-V, InternVL, and BLIP. The two most popular architectures for creating these multimodal LLMs are the cross-attention-based architecture (like Flamingo and Llama 3-V), which manages image tokens through LLM cross-attention layers, and the decoder-only architecture (like LLaVA and InternVL), which processes image tokens inside the LLM self-attention layers.

Inconsistent architecture comparisons: Unlike text-based LLMs, multimodal LLM architectures (e.g., decoder-only vs. cross-attention models) haven’t been compared uniformly, due to differences in model backbones, vision encoders, and training data. This makes direct comparisons challenging. For instance, the open-access IDEFICS-80B (based on LLaMA-65B) is considered inferior to LLaVA-1.5-13B (based on Vicuna-13B) in visual question-answering tasks.
Handling high-resolution image input: While models that use dynamic high-resolution images perform well on OCR tasks, they sometimes show reduced accuracy in reasoning tasks compared to low-resolution models.
Degradation in text-only performance: Open-access multimodal LLMs show strong performance on vision-language tasks but suffer in text-only tasks, unlike proprietary models like GPT-4. Llama 3-V addresses this by freezing LLM parameters, but these models are not yet publicly available.

Addressing those limitations

To address these limitations NVIDIA introduced NVLM 1.0 Family, a multimodal family LLMs

NVLM-D: A decoder-only architecture
NVLM-X: A cross-attention-based architecture
NVLM-H: A novel Hybrid architecture

All three models are trained on the same curated data blend. The architectures achieve state-of-the-art performance while offering practitioners flexible and feature-rich model options.

Model architecture: A comparison between decoder-only and cross-attention models shows that cross-attention-based NVLM-X is more computationally efficient with high-resolution images, while the decoder-only NVLM-D performs better in OCR tasks and reasoning. Based on these insights, a hybrid model, NVLM-H, is proposed, which balances efficiency and reasoning ability.
High-resolution image processing: A new tile-tagging design is introduced for handling high-resolution images, improving OCR tasks and multimodal reasoning performance. Ablation studies reveal that adding text-based tags to image tokens enhances accuracy.
Training data: The study emphasizes the importance of data quality and diversity over scale in multimodal pretraining and supervised fine-tuning (SFT). Abundant, diverse pretraining data benefits both cross-attention and decoder-only models. Compared to previous works, the team curated a larger, task-oriented dataset for SFT.
Production-grade multimodality: To ensure the NVLM models excel in both vision-language and text-only tasks, two strategies are employed: freezing LLM parameters in cross-attention models to maintain text performance, and integrating a high-quality text dataset into multimodal fine-tuning. This approach not only preserves text-only performance but also improves capabilities in math and coding tasks.

Also Read: Top 5 FREE Generative AI Courses by NVIDIA

NVLM: Models and Training Methods

Decoder-only (NVLM-D): This model handles multimodal inputs by processing image tokens directly within the language model’s self-attention layers, making it well-suited for unified multimodal reasoning tasks such as OCR and document understanding.
Cross-attention-based (NVLM-X): It processes image tokens through cross-attention layers, which makes it computationally efficient, especially when dealing with high-resolution images. This model excels in handling image-heavy tasks and offers higher throughput during training compared to decoder-only models.
Hybrid (NVLM-H): This model combines the advantages of both NVLM-D and NVLM-X by processing thumbnail images and text tokens jointly in the LLM’s self-attention layers, while finer image details are handled through cross-attention. It improves both computational efficiency and reasoning capabilities for multimodal tasks.

All models share a vision encoder (InternViT-6B) and employ a dynamic high-resolution (DHR) approach, which divides high-resolution images into smaller tiles for processing. The models handle different tasks through a variety of text-based tags and modality-alignment modules. The training method is split into two phases:

Pretraining, where the vision encoder and LLM are frozen.
Supervised fine-tuning (SFT), which trains both the LLM and modality-alignment modules.

NVLM: Models and Training Methods — Source

NVLM-1.0 offers three architectural options: the cross-attention-based NVLM-X (top), the hybrid NVLM-H (middle), and the decoder-only NVLM-D (bottom). The dynamic high-resolution vision pathway is shared by all three models. However, different architectures process the image features from thumbnails and regular local tiles in distinct ways.

Training Data

The authors provide a detailed breakdown of the curated datasets used for both pretraining and SFT.

Pretraining datasets include captioning, visual question answering (VQA), document understanding, and OCR-related data. The study emphasizes the importance of data quality and diversity over sheer scale, noting that noisy datasets hinder the model’s ability to learn effectively.
The multimodal pretraining datasets cover a wide range of tasks, from image captioning (COCO, LAION-115M) to document OCR (OCR-VQA, ReCTs) and math reasoning in visual contexts (CLEVR-Math). A notable finding is that diverse task-oriented datasets, such as VQA and OCR, significantly enhance cross-modal alignment and improve final results.
During SFT, the model is fine-tuned on a high-quality blend of multimodal datasets to enhance vision-language understanding. The SFT stage incorporates datasets like TextVQA, ChartQA, DocVQA, and AI2D. Text-only fine-tuning datasets are also used to prevent degradation of text-only performance. A special effort is made to ensure that the fine-tuning data includes math and coding tasks, helping the model to improve performance in these areas.

Results

The NVLM-1.0 family is evaluated across multiple benchmarks, demonstrating competitive or superior performance compared to other leading multimodal and text-only models, both proprietary (e.g., GPT-4o, Claude 3.5) and open-access (e.g., LLaVA, InternVL). Key findings include:

NVLM-D outperformed all open-access models on OCR benchmarks like OCRBench and VQAv2, highlighting its strength in vision-language tasks like scene text reading and document understanding.
NVLM-H showed the highest scores on multimodal reasoning tasks (e.g., MMMU, MathVista) and demonstrated superior computational efficiency. This hybrid model combines the strengths of both decoder-only and cross-attention approaches, achieving state-of-the-art results on vision-language tasks without sacrificing efficiency.
NVLM-X demonstrated best-in-class performance among cross-attention-based models, particularly for tasks involving high-resolution images, and had the advantage of faster training and inference speeds.

NVLM models maintained or improved their performance on text-only tasks (like coding and math benchmarks such as MMLU, GSM8K, MATH, and HumanEval) after multimodal training, which is a significant achievement, as other multimodal models typically experience degradation in these areas.

Accessing NVLM D 72B

We can access the model using the hugging face function and the transformers library. Below is the code to infer the NVLM D 72B model; this is straight out of the documentation. Note that this is a 150+ GB model.

Import necessary libraries

import torch

from transformers import AutoTokenizer, AutoModel

import math

from PIL import Image

import torchvision.transforms as T

from torchvision.transforms.functional import InterpolationMode

Model Sharding

The split_model() function defines a device map for distributing the layers of the model across multiple GPUs

def split_model():

   device_map = {}

   world_size = torch.cuda.device_count()

   num_layers = 80

   # Since the first GPU will be used for ViT, treat it as half a GPU.

   num_layers_per_gpu = math.ceil(num_layers / (world_size - 0.5))

   num_layers_per_gpu = [num_layers_per_gpu] * world_size

   num_layers_per_gpu[0] = math.ceil(num_layers_per_gpu[0] * 0.5)

   layer_cnt = 0

   for i, num_layer in enumerate(num_layers_per_gpu):

       for j in range(num_layer):

           device_map[f'language_model.model.layers.{layer_cnt}'] = i

           layer_cnt += 1

   device_map['vision_model'] = 0

   device_map['mlp1'] = 0

   device_map['language_model.model.tok_embeddings'] = 0

   device_map['language_model.model.embed_tokens'] = 0

   device_map['language_model.output'] = 0

   device_map['language_model.model.norm'] = 0

   device_map['language_model.lm_head'] = 0

   device_map[f'language_model.model.layers.{num_layers - 1}'] = 0

   return device_map

This distribution ensures efficient use of multiple GPUs to handle large models.

Image Preprocessing

IMAGENET_MEAN = (0.485, 0.456, 0.406)

IMAGENET_STD = (0.229, 0.224, 0.225)

def build_transform(input_size):

   MEAN, STD = IMAGENET_MEAN, IMAGENET_STD

   transform = T.Compose([

       T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),

       T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),

       T.ToTensor(),

       T.Normalize(mean=MEAN, std=STD)

   ])

   return transform

Dynamic image tiling

This function splits an image into smaller tiles based on its aspect ratio

def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):

   best_ratio_diff = float('inf')

   best_ratio = (1, 1)

   area = width * height

   for ratio in target_ratios:

       target_aspect_ratio = ratio[0] / ratio[1]

       ratio_diff = abs(aspect_ratio - target_aspect_ratio)

       if ratio_diff < best_ratio_diff:

           best_ratio_diff = ratio_diff

           best_ratio = ratio

       elif ratio_diff == best_ratio_diff:

           if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:

               best_ratio = ratio

   return best_ratio

def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):

   orig_width, orig_height = image.size

   aspect_ratio = orig_width / orig_height

   # calculate the existing image aspect ratio

   target_ratios = set(

       (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if

       i * j <= max_num and i * j >= min_num)

   target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])

   # find the closest aspect ratio to the target

   target_aspect_ratio = find_closest_aspect_ratio(

       aspect_ratio, target_ratios, orig_width, orig_height, image_size)

   # calculate the target width and height

   target_width = image_size * target_aspect_ratio[0]

   target_height = image_size * target_aspect_ratio[1]

   blocks = target_aspect_ratio[0] * target_aspect_ratio[1]

   # resize the image

   resized_img = image.resize((target_width, target_height))

   processed_images = []

   for i in range(blocks):

       box = (

           (i % (target_width // image_size)) * image_size,

           (i // (target_width // image_size)) * image_size,

           ((i % (target_width // image_size)) + 1) * image_size,

           ((i // (target_width // image_size)) + 1) * image_size

       )

       # split the image

       split_img = resized_img.crop(box)

       processed_images.append(split_img)

   assert len(processed_images) == blocks

   if use_thumbnail and len(processed_images) != 1:

       thumbnail_img = image.resize((image_size, image_size))

       processed_images.append(thumbnail_img)

   return processed_images

Loading and Preprocessing Images

def load_image(image_file, input_size=448, max_num=12):

   image = Image.open(image_file).convert('RGB')

   transform = build_transform(input_size=input_size)

   images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)

   pixel_values = [transform(image) for image in images]

   pixel_values = torch.stack(pixel_values)

   return pixel_values

Loading and Using the Model

path = "nvidia/NVLM-D-72B"

device_map = split_model()

model = AutoModel.from_pretrained(

   path,

   torch_dtype=torch.bfloat16,

   low_cpu_mem_usage=True,

   use_flash_attn=False,

   trust_remote_code=True,

   device_map=device_map).eval()

print(model)

Text and Image Conversations

tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)

generation_config = dict(max_new_tokens=1024, do_sample=False)

# pure-text conversation

question = 'Hello, who are you?'

response, history = model.chat(tokenizer, None, question, generation_config, history=None, return_history=True)

print(f'User: {question}\nAssistant: {response}')

# single-image single-round conversation

pixel_values = load_image('path/to/your/example/image.jpg', max_num=6).to(

   torch.bfloat16)

question = '<image>\nPlease describe the image shortly.'

response = model.chat(tokenizer, pixel_values, question, generation_config)

print(f'User: {question}\nAssistant: {response}')

Conclusion

We can highlight that the NVLM-1.0 family achieves state-of-the-art results across a wide range of vision-language and text-only tasks, maintaining production-grade multimodality. This means the models perform well in both multimodal and text-only settings, without significant degradation in text-only performance—a common issue in many other multimodal models. The authors also emphasize the importance of high-quality training data and diverse task-oriented datasets for boosting model performance.

The NVLM-1.0 family demonstrates that it is possible to create multimodal LLMs that excel in a wide variety of tasks, including reasoning, coding, and math. In their commitment to furthering research, the team plans to release the model weights and open-source the code, inviting the community to build upon their work.

Hope you like the article! NVIDIA has launched NVLM 1.0, an open-source large language model that enhances AI capabilities. For NVLM 1.0 download, users can access the model weights on Hugging Face. To learn how to use NVLM 1.0, refer to the accompanying documentation. The NVLM 1.0 paper details its architecture and performance improvements, while installation instructions are provided for a smooth NVLM 1.0 install process.

Are you looking for an online Generative AI course? If yes, explore this: GenAI Pinnacle Program.

Q1. What is NVLM 1.0?

Ans. NVLM 1.0 is a family of open-source, multimodal large language models by NVIDIA. It excels in both vision-language tasks and text-only tasks, rivaling leading proprietary and open-access models.

Q2. What are the key architectures in NVLM 1.0?

Ans. NVLM 1.0 includes three model architectures:

– NVLM-D: A decoder-only model for unified multimodal reasoning tasks like OCR and document understanding.
– NVLM-X: A cross-attention-based model for efficient high-resolution image processing.
– NVLM-H: A hybrid model that balances efficiency and reasoning by combining elements of both NVLM-D and NVLM-X.

Q3. What makes NVLM 1.0 unique?

Ans. NVLM 1.0 is trained in two phases:
Pretraining: The vision encoder and LLM are frozen, and only modality-alignment layers are trained.
Supervised Fine-Tuning (SFT): Both the LLM and modality-alignment layers are fine-tuned on a curated set of multimodal tasks, ensuring strong performance on vision-language and text-only tasks.

Q4. Does Nvidia have an LLM?

Ans. Yes, NVIDIA has an LLM framework called NeMo. It helps developers create and train custom language models.

Badrinarayan M

Data science Trainee at Analytics Vidhya, specializing in ML, DL and Gen AI. Dedicated to sharing insights through articles on these subjects. Eager to learn and contribute to the field's advancements. Passionate about leveraging data to solve complex problems and drive innovation.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to Generative AI

Introduction to Generative AI applications

No-code Generative AI app development

Code-focused Generative AI App Development

Introduction to Responsible AI

LLMS

Prompt Engineering

Finetuning LLMs

Training LLMs from Scratch

Langchain

RAG

LlamaIndex

Stable Diffusion

NVLM 1.0: NVIDIA’s Innovative Approach to Multimodal LLMs

Overview

Table of contents

Qualitative Examples of NVLM 1.0 D 74B

Comparison of NVLM with Other LLM

Limitations of other Multimodal LLMs

Addressing those limitations

NVLM: Models and Training Methods

Training Data

Results

Accessing NVLM D 72B

Import necessary libraries

Model Sharding

Image Preprocessing

Dynamic image tiling

Loading and Preprocessing Images

Loading and Using the Model

Text and Image Conversations

Conclusion

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang