We are going to look into the recently released multimodal large language model NVLM 1.0 by NVIDIA. These models achieve state-of-the-art results on vision-language tasks, even rivalling the leading proprietary models and open-access models (Llama 3-V 405B and InternVL 2). NVLM 1.0 shows improved text-only performance over its LLM backbone after multimodal training. NVLM is open-sourced; the model weights and code are open for the community.
NVIDIA conducts a thorough model design comparison between cross-attention-based models (e.g., Flamingo) and decoder-only multimodal LLMs (e.g., LLaVA). Based on the merits and shortcomings of both approaches, they presented a unique architecture that boosts both training efficiency and multimodal reasoning skills.
Illustration of the powerful scene understanding capabilities of the NVLM-1.0-D 72B model. It has the common sense to identify possible risks or mishaps and accurately recommends what needs to be done right away.
Additional illustrations of the NVLM-1.0-D 72B model’s capacity to comprehend memes, a difficult undertaking including a sense of humour and familiarity with significant societal trends, context, or occurrences.
When comparing popular open-access and private multimodal LLMs with NVLM 1.0. Note that the model weights for *Llama 3-V have not been provided as of the time of this report. The outcomes show that NVLM 1.0 performs comparably to top models in both vision-language and text-only tasks. Furthermore, multimodal LLM is compared to its backbone LLM on text-only tasks.
After multimodal training, InternVL2-Llama3-76B’s text performance drastically declines. Llama 3-V 70B and 405B exhibit no degradation in text-only tasks because multimodal training freezes their LLM backbones. However, the NVLM-1.0-D 72B model shows notable improvements over its text backbone on text-only math and coding benchmarks, with average accuracy rising by 4.3 points following multimodal training.
Also Read: Nvidia Introduces VILA: Visual Language Intelligence and Edge AI 2.0
The field has advanced the possibilities of open-access multimodal LLMs to a considerable degree. Prominent groups of open models consist of LLaVA, Llama 3-V, InternVL, and BLIP. The two most popular architectures for creating these multimodal LLMs are the cross-attention-based architecture (like Flamingo and Llama 3-V), which manages image tokens through LLM cross-attention layers, and the decoder-only architecture (like LLaVA and InternVL), which processes image tokens inside the LLM self-attention layers.
To address these limitations NVIDIA introduced NVLM 1.0 Family, a multimodal family LLMs
All three models are trained on the same curated data blend. The architectures achieve state-of-the-art performance while offering practitioners flexible and feature-rich model options.
Also Read: Top 5 FREE Generative AI Courses by NVIDIA
All models share a vision encoder (InternViT-6B) and employ a dynamic high-resolution (DHR) approach, which divides high-resolution images into smaller tiles for processing. The models handle different tasks through a variety of text-based tags and modality-alignment modules. The training method is split into two phases:
NVLM-1.0 offers three architectural options: the cross-attention-based NVLM-X (top), the hybrid NVLM-H (middle), and the decoder-only NVLM-D (bottom). The dynamic high-resolution vision pathway is shared by all three models. However, different architectures process the image features from thumbnails and regular local tiles in distinct ways.
The authors provide a detailed breakdown of the curated datasets used for both pretraining and SFT.
Also Read: What are Multimodal Models?
The NVLM-1.0 family is evaluated across multiple benchmarks, demonstrating competitive or superior performance compared to other leading multimodal and text-only models, both proprietary (e.g., GPT-4o, Claude 3.5) and open-access (e.g., LLaVA, InternVL). Key findings include:
NVLM models maintained or improved their performance on text-only tasks (like coding and math benchmarks such as MMLU, GSM8K, MATH, and HumanEval) after multimodal training, which is a significant achievement, as other multimodal models typically experience degradation in these areas.
We can access the model using the hugging face function and the transformers library. Below is the code to infer the NVLM D 72B model; this is straight out of the documentation. Note that this is a 150+ GB model.
import torch
from transformers import AutoTokenizer, AutoModel
import math
from PIL import Image
import torchvision.transforms as T
from torchvision.transforms.functional import InterpolationMode
The split_model() function defines a device map for distributing the layers of the model across multiple GPUs
def split_model():
device_map = {}
world_size = torch.cuda.device_count()
num_layers = 80
# Since the first GPU will be used for ViT, treat it as half a GPU.
num_layers_per_gpu = math.ceil(num_layers / (world_size - 0.5))
num_layers_per_gpu = [num_layers_per_gpu] * world_size
num_layers_per_gpu[0] = math.ceil(num_layers_per_gpu[0] * 0.5)
layer_cnt = 0
for i, num_layer in enumerate(num_layers_per_gpu):
for j in range(num_layer):
device_map[f'language_model.model.layers.{layer_cnt}'] = i
layer_cnt += 1
device_map['vision_model'] = 0
device_map['mlp1'] = 0
device_map['language_model.model.tok_embeddings'] = 0
device_map['language_model.model.embed_tokens'] = 0
device_map['language_model.output'] = 0
device_map['language_model.model.norm'] = 0
device_map['language_model.lm_head'] = 0
device_map[f'language_model.model.layers.{num_layers - 1}'] = 0
return device_map
This distribution ensures efficient use of multiple GPUs to handle large models.
IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)
def build_transform(input_size):
MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
transform = T.Compose([
T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
T.ToTensor(),
T.Normalize(mean=MEAN, std=STD)
])
return transform
This function splits an image into smaller tiles based on its aspect ratio
def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
best_ratio_diff = float('inf')
best_ratio = (1, 1)
area = width * height
for ratio in target_ratios:
target_aspect_ratio = ratio[0] / ratio[1]
ratio_diff = abs(aspect_ratio - target_aspect_ratio)
if ratio_diff < best_ratio_diff:
best_ratio_diff = ratio_diff
best_ratio = ratio
elif ratio_diff == best_ratio_diff:
if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
best_ratio = ratio
return best_ratio
def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
orig_width, orig_height = image.size
aspect_ratio = orig_width / orig_height
# calculate the existing image aspect ratio
target_ratios = set(
(i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
i * j <= max_num and i * j >= min_num)
target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
# find the closest aspect ratio to the target
target_aspect_ratio = find_closest_aspect_ratio(
aspect_ratio, target_ratios, orig_width, orig_height, image_size)
# calculate the target width and height
target_width = image_size * target_aspect_ratio[0]
target_height = image_size * target_aspect_ratio[1]
blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
# resize the image
resized_img = image.resize((target_width, target_height))
processed_images = []
for i in range(blocks):
box = (
(i % (target_width // image_size)) * image_size,
(i // (target_width // image_size)) * image_size,
((i % (target_width // image_size)) + 1) * image_size,
((i // (target_width // image_size)) + 1) * image_size
)
# split the image
split_img = resized_img.crop(box)
processed_images.append(split_img)
assert len(processed_images) == blocks
if use_thumbnail and len(processed_images) != 1:
thumbnail_img = image.resize((image_size, image_size))
processed_images.append(thumbnail_img)
return processed_images
def load_image(image_file, input_size=448, max_num=12):
image = Image.open(image_file).convert('RGB')
transform = build_transform(input_size=input_size)
images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
pixel_values = [transform(image) for image in images]
pixel_values = torch.stack(pixel_values)
return pixel_values
path = "nvidia/NVLM-D-72B"
device_map = split_model()
model = AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=False,
trust_remote_code=True,
device_map=device_map).eval()
print(model)
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)
generation_config = dict(max_new_tokens=1024, do_sample=False)
# pure-text conversation
question = 'Hello, who are you?'
response, history = model.chat(tokenizer, None, question, generation_config, history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')
# single-image single-round conversation
pixel_values = load_image('path/to/your/example/image.jpg', max_num=6).to(
torch.bfloat16)
question = '<image>\nPlease describe the image shortly.'
response = model.chat(tokenizer, pixel_values, question, generation_config)
print(f'User: {question}\nAssistant: {response}')
We can highlight that the NVLM-1.0 family achieves state-of-the-art results across a wide range of vision-language and text-only tasks, maintaining production-grade multimodality. This means the models perform well in both multimodal and text-only settings, without significant degradation in text-only performance—a common issue in many other multimodal models. The authors also emphasize the importance of high-quality training data and diverse task-oriented datasets for boosting model performance.
The NVLM-1.0 family demonstrates that it is possible to create multimodal LLMs that excel in a wide variety of tasks, including reasoning, coding, and math. In their commitment to furthering research, the team plans to release the model weights and open-source the code, inviting the community to build upon their work.
Hope you like the article! NVIDIA has launched NVLM 1.0, an open-source large language model that enhances AI capabilities. For NVLM 1.0 download, users can access the model weights on Hugging Face. To learn how to use NVLM 1.0, refer to the accompanying documentation. The NVLM 1.0 paper details its architecture and performance improvements, while installation instructions are provided for a smooth NVLM 1.0 install process.
Are you looking for an online Generative AI course? If yes, explore this: GenAI Pinnacle Program.
Ans. NVLM 1.0 is a family of open-source, multimodal large language models by NVIDIA. It excels in both vision-language tasks and text-only tasks, rivaling leading proprietary and open-access models.
Ans. NVLM 1.0 includes three model architectures:
– NVLM-D: A decoder-only model for unified multimodal reasoning tasks like OCR and document understanding.
– NVLM-X: A cross-attention-based model for efficient high-resolution image processing.
– NVLM-H: A hybrid model that balances efficiency and reasoning by combining elements of both NVLM-D and NVLM-X.
Ans. NVLM 1.0 is trained in two phases:
Pretraining: The vision encoder and LLM are frozen, and only modality-alignment layers are trained.
Supervised Fine-Tuning (SFT): Both the LLM and modality-alignment layers are fine-tuned on a curated set of multimodal tasks, ensuring strong performance on vision-language and text-only tasks.
Ans. Yes, NVIDIA has an LLM framework called NeMo. It helps developers create and train custom language models.