The most powerful VLMs available today remain proprietary, limiting open research exploration. Open models often lag due to dependency on synthetic data generated by proprietary models, restricting true openness. Molmo, a sophisticated vision-language model, seeks to bridge this gap by creating high-quality multimodal capabilities built from open datasets and independent training methods.
PixMo, the accompanying dataset, was designed to overcome the traditional limitations of data accessibility in VLM development. The team collected extensive image-caption pairs using human speech annotations, which resulted in high-density captions free from the constraints of synthetic datasets.
Molmo’s architecture follows a standard multimodal design: it combines a vision encoder and a language model to create a vision-language model capable of processing both images and text.
The input to Molmo is generated by applying multi-scale and multi-crop transformations to the original image. In multi-crop training, multiple crops (sections) of the same image are taken from different regions, often at various scales and resolutions. Each crop provides a different perspective or focus area of the image.
The core of Molmo’s visual processing is OpenAI’s CLIP (Contrastive Language Image-Pretraining) model, a powerful Vision Transformer (ViT) optimized for high-resolution inputs.
The connector is a carefully constructed MLP that projects the high-dimensional tokens from CLIP to match the input space (dimensions) the language model requires. Following this projection, a pooling layer performs dimensionality reduction, ensuring the visual tokens are condensed to a manageable size for the language model without sacrificing key visual details.
Dimensionality Reduction Through Pooling: Pooling selects and averages key features across the visual tokens. Conceptually, this can be thought of as a summary of visual information—just enough detail to inform the language model without overwhelming it.
Example: Imagine a cityscape image divided into 100 tokens by the vision encoder. Pooling condenses these tokens by summarizing key features, prioritizing prominent structures (like buildings), and reducing redundancy in repetitive areas (like the sky). This results in a smaller, focused set of around 20 tokens, capturing only the most essential details for efficient processing by the language model.
Molmo’s vision encoder remains consistent across variants, employing CLIP’s ViT-L/14 model for all versions. However, Molmo’s LLM component varies based on requirements for capacity, openness, and compute efficiency:
In transformers, decoder-only architecture is particularly suited for tasks requiring context-based generation, such as captioning or question-answering. The model “decodes” tokens in a self-referential manner, with each token attending to all previous tokens to build a coherent output, guided by both visual and textual cues from previous stages.
Molmo’s training is divided into two major stages that contribute to model’s high performance and versatility:
Goal: Train the model to generate detailed, accurate captions for images. PixMo-Cap dataset is used in this step.
Molmo uses a simpler, single-stage pre-training method for caption generation, which avoids the complexity and potential inefficiencies of multi-stage pre-training (e.g., freezing parts of the model/network at different stages).
Molmo’s simpler, single-stage pre-training works well in its context because:
After pre-training for caption generation, Molmo is fine-tuned on a mixture of datasets, including standard academic datasets and additional PixMo datasets like PixMo-AskModelAnything, PixMo-Points, PixMo-Clocks, and PixMo-Docs. The fine-tuning includes supervised training data for tasks like question answering, counting, and point-based referencing.
Evaluating multimodal models can be challenging due to the complexity of visual and linguistic tasks. The Molmo team gauged performance using a combination of academic benchmarks and extensive human evaluations.
Also read: Hands-On Multimodal Retrieval and Interpretability (ColQwen + Vespa)
Now that we are clear with the architecture of Molmo let’s get hands-on and try out some examples with Molmo. In this section, we’ll walk through using Molmo on example images to extract structured information. This hands-on session will help you understand how to load the model, process images, generate outputs, and customize it for your own data.
Colab notebook: Molmo-VLM-handson.ipynb (I have used A100 High-Ram GPU for running these experiments)
First, we need to install some essential packages. These include transformers for model processing, torch for handling tensors, Pillow for image manipulation, and pytesseract for OCR (Optical Character Recognition).
!pip install -q transformers torch Pillow einops
!pip install -q pytesseract
!apt-get install -y tesseract-ocr
Here, we specify the Molmo model we want to use (in this case, MolmoE-1B-0924) and load it along with its processor.
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
from PIL import Image
import torch
model_name = 'allenai/MolmoE-1B-0924'
processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True, torch_dtype='auto', device_map='auto')
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, torch_dtype='auto', device_map='auto')
model.to("cuda")
AutoProcessor prepares the inputs for Molmo, handling both images and text prompts. AutoModelForCausalLM loads the language model. Setting device_map=’auto’ ensures the model is loaded onto the best available device (like GPU) for faster performance.
To work with an image, we load it using Pillow and display it to confirm we have the correct input.
image_path = 'your_image.png' # provide the image path here
image = Image.open(image_path).convert('RGB')
image
This code loads an image from the specified path and converts it to RGB format, ensuring compatibility with the model.
If an image is too large, you can resize it for consistent processing and then display the image. This function resizes images with a height greater than 800 pixels. Reducing image size can optimize processing without significantly affecting the model’s ability to interpret content.
def resize_image(image, max_height=800):
width, height = image.size
if height > max_height:
ratio = max_height / height
new_width = int(width * ratio)
new_height = int(height * ratio)
return image.resize((new_width, new_height))
return image
We define a text prompt and process both the image and text together using the processor.
inputs = processor.process(
images=[image],
text="Extract all the information from the page in JSON format, especially the account summary and all contact details in proper format."
)
inputs = {k: v.to(model.device).unsqueeze(0) for k, v in inputs.items()}
The processor combines the image and text into a format the model can interpret. Each input is moved to the model’s device (usually GPU) and reshaped for batch processing.
Using the model’s generate_from_batch function, we generate an output based on the image and prompt.
output = model.generate_from_batch(
inputs,
GenerationConfig(max_new_tokens=500, stop_strings="<|endoftext|>"),
tokenizer=processor.tokenizer
)
generated_tokens = output[0, inputs['input_ids'].size(1):]
generated_text = processor.tokenizer.decode(generated_tokens, skip_special_tokens=True)
print(generated_text)
Here, we set a maximum limit of 500 tokens (you can increase or decrease the number of tokens according to your usecase) for the response and define a stop condition (<|endoftext|>). This line (output[0, inputs[‘input_ids’].size(1):] ) extracts only the generated tokens with slicing which skips the input prompt tokens in the output. This isolates the newly generated tokens and avoids redundancy in responses.
The model processes the inputs and generates tokens representing the text output, which we then decode to human-readable text. This allows us to see Molmo’s extracted information based on our prompt.
def generate_text(image_path, prompt):
image = Image.open(image_path).convert('RGB')
inputs = processor.process(
images=[image],
text=prompt
)
inputs = {k: v.to(model.device).unsqueeze(0) for k, v in inputs.items()}
output = model.generate_from_batch(
inputs,
GenerationConfig(max_new_tokens=500, stop_strings="<|endoftext|>"),
tokenizer=processor.tokenizer
)
generated_tokens = output[0,inputs['input_ids'].size(1):]
generated_text = processor.tokenizer.decode(generated_tokens, skip_special_tokens=True)
return image, generated_text
You can pass custom prompts to refine the model’s focus. In this case, we’re asking for detailed information, specifying a JSON format for structured data extraction. This helps Molmo return data that’s ready for further processing or analysis.
The image from which we are extracting data:
input_path = '/content/Visualization - Binary Quantization.png'
prompt = '''You are an expert mathematician. You need to understand what is been mentioned in this page and outline the topics along with explanation.
The output should be in json format with keys "topics mentioned", "explanation": {"exp_topic1", "exp_topic2", ...}
'''
image, generated_text = generate_text(input_path, prompt)
resize_image(image)
print(generated_text)
Output:
{
"topics mentioned": [
"Query and token",
"Binary quantization",
"Hamming distance",
"Minimum Hamming distance",
"Query and token embeddings",
"Final hamming similarity"
],
"explanation": {
"query and token": "The image discusses the process of converting each
value in a query or token into either 1 or 0, depending on whether it
represents a positive or negative value respectively. This technique is used
in binary quantization.",
"binary quantization": "This is a method for representing real numbers in
binary format with a fixed number of bits. The image explains how to convert
floating-point numbers to binary and then calculate the Hamming distance
between two binary vectors.",
"Hamming distance": "This is a measure of how many bit positions differ
between two binary vectors. The image shows how to calculate this distance
between two binary vectors of different lengths.",
"minimum Hamming distance": "This refers to the shortest distance between
two vectors of the same length, excluding the vector itself. The image
provides formulas for calculating this distance for different token sizes
and query lengths.",
"query and token embeddings": "The image describes how to represent query
and token data in a 4-dimensional space using multi-vector embeddings. It
explains the process of tokenization and the use of binary quantization for
this representation.",
"final hamming similarity": "The image concludes by discussing the
calculation of overall hamming similarity between two query vectors and
their embeddings"
}
}
We can also take a complex example where there are many tables and see how much data the model can extract in one go:
input_path = '/content/0fa82bab-e131-43dd-86da-7153b2ecc76d.png'
prompt = '''Extract all the information from the page in json, each and every data needs to be present. Don't miss out on contact details, name, address, account bill summary, billing history and ways to pay.
The output should be in json format with keys being all the data found in the page. Information is crucial.
'''
image, generated_text = generate_text(input_path, prompt, max_tokens=1000)
print(generated_text)
resize_image(image, max_height=600) # displaying the image my resizing it 600 pixels height
Output:
{
"energyStatement": {
"accountNumber": "5553220335-0",
"statementDate": "01/30/2024",
"dueDate": "02/20/2024",
"website": "www.pge.com/myenergy",
"serviceInfo": {
"meterNumber": "10098180854",
"totalUsage": "518.53 MWh",
" rotatingOutageBlock": "10F",
"serviceID": "5534591016"
},
"billingHistory": {
"billingcycles": "33 billing cycles",
"billingcyclesToDate": "12/31/2023",
"currentBillingcycle": "12/22/2023"
},
"serviceSchedule": {
"serviceID": "5534591016",
"schedule": "EVA Home Charging"
},
"electricDeliveryCharges": {
"total": "$139.29",
"2018VintagePowerChargeInferenceAdjustment": "1.00"
},
"contactInfo": {
"phoneNumber": "555-123-4567",
"email": "[email protected]"
}
}
}
From the above image, as we can see in on the go, most of the details are extracted, but what if we don’t want to miss a single piece of information from the page and the page is dense with information? There, we can try an approach to split the image into multiple patches and pass those patches separately to the model to extract data that we can eventually combine together.
To handle complex images with diverse regions, split them into smaller patches and process each patch individually. Here, we are following a straightforward approach of splitting the image into 4 equal sections. This is useful for large documents where different regions may contain distinct information, and also sections are equally divided (like research papers).
def split_image_into_patches(image):
width, height = image.size
patches = {
"top_left": image.crop((0, 0, width // 2, height // 2)),
"top_right": image.crop((width // 2, 0, width, height // 2)),
"bottom_left": image.crop((0, height // 2, width // 2, height)),
"bottom_right": image.crop((width // 2, height // 2, width, height))
}
return patches
Each patch is processed separately with a prompt to extract relevant details. We store each patch’s result in a dictionary.
extracted_data = {}
for patch_name, patch_image in image_patches.items():
inputs = processor.process(
images=[patch_image],
text="Extract all the information from page in JSON, each and every data needs to be present."
)
inputs = {k: v.to(model.device).unsqueeze(0) for k, v in inputs.items()}
output = model.generate_from_batch(
inputs,
GenerationConfig(max_new_tokens=500, stop_strings="<|endoftext|>"),
tokenizer=processor.tokenizer
)
generated_tokens = output[0, inputs['input_ids'].size(1):]
generated_text = processor.tokenizer.decode(generated_tokens, skip_special_tokens=True)
extracted_data[patch_name] = generated_text
The above approach of splitting images equally is similar to splitting a long text document into fixed-length text chunks. However, if the chunks are divided between a continuing text then we lose context. This concept applies to images too. So, instead of splitting the image equally, what if we split the image based on visually semantic chunks.
We will be trying out a simple approach here: combining OCR with calculating the line gap in bounding boxes to create a group of patches from an image and then pass those patches to the Molmo model.
We can apply OCR to identify text regions in the image and return the text along with bounding boxes.
import pytesseract
def extract_text_regions(image):
ocr_data = pytesseract.image_to_data(image, output_type=pytesseract.Output.DICT)
text_regions = []
for i, word in enumerate(ocr_data['text']):
if word.strip(): # Ignore empty strings
x, y, w, h = ocr_data['left'][i], ocr_data['top'][i], ocr_data['width'][i], ocr_data['height'][i]
text_regions.append({
"text": word,
"bbox": (x, y, x + w, y + h)
})
return text_regions
We can group text regions into logical chunks (like paragraphs or tables) for more logical extraction. This function groups words into larger chunks, like lines or paragraphs, based on their bounding box positions (calculation of vertical line gap between bounding boxes). It’s useful for extracting more contextually coherent information from documents.
def group_text_regions(text_regions, line_threshold=10):
grouped_regions = []
current_group = []
last_bottom = -1
for region in text_regions:
_, top, _, bottom = region['bbox']
if last_bottom != -1 and (top - last_bottom > line_threshold):
grouped_regions.append(current_group)
current_group = []
current_group.append(region)
last_bottom = bottom
if current_group:
grouped_regions.append(current_group)
return grouped_regions
Now, we will apply this approach on a page to create groups and pass each patch to the model for extraction. Once all the json data are extracted, we can pass it to an LLM to combine everything together.
# Apply OCR to identify text regions
text_regions = extract_text_regions(image)
# Group text regions into semantic chunks
semantic_chunks = group_text_regions(text_regions)
# Initialize a dictionary to store extracted data from each chunk
extracted_data = {}
# Loop through each semantic chunk, process, and store the output
for idx, chunk in enumerate(semantic_chunks):
# Create a bounding box for the chunk
x_min = min([r['bbox'][0] for r in chunk])
y_min = min([r['bbox'][1] for r in chunk])
x_max = max([r['bbox'][2] for r in chunk])
y_max = max([r['bbox'][3] for r in chunk])
# Crop the image to the bounding box of the chunk
chunk_image = image.crop((x_min, y_min, x_max, y_max))
# Prepare text prompt for Molmo
chunk_text = " ".join([r['text'] for r in chunk])
prompt_text = f"Extract information from this section: {chunk_text} in JSON format."
# Process the chunk image and prompt with Molmo
inputs = processor.process(
images=[chunk_image],
text=prompt_text
)
inputs = {k: v.to(model.device).unsqueeze(0) for k, v in inputs.items()}
output = model.generate_from_batch(
inputs,
GenerationConfig(max_new_tokens=500, stop_strings="<|endoftext|>"),
tokenizer=processor.tokenizer
)
generated_tokens = output[0, inputs['input_ids'].size(1):]
generated_text = processor.tokenizer.decode(generated_tokens, skip_special_tokens=True)
print(generated_text, "\n\n")
# Store the extracted data for the current chunk
extracted_data[f"chunk_{idx}"] = generated_text
# Combine all extracted data
combined_data = { "page_summary": extracted_data }
This was a fun experiment, but it is not yet the best-optimized approach. We can improve it further by using segmentation to create logical chunks. If we plan to use OCR, then grouping needs to be more strict and heuristic-based (considering both vertical and horizontal line gaps and some checks on the amount of text or data available).
In this deep dive into Molmo and PixMo, we explored the motivations behind developing open and robust vision-language models, the detailed architecture of Molmo, and the unique datasets powering its capabilities. We walked through key design decisions, including why Molmo opted for a simpler, single-stage training pipeline and chose CLIP as the vision encoder for its superior performance in handling multi-crop, high-resolution images. The hands-on section showcased Molmo’s flexibility in extracting complex structured data, providing you with practical examples and code to try out yourself. By embracing transparency, high-quality data, and efficient training strategies, Molmo sets a new standard in open multimodal research, offering a versatile tool for tackling diverse vision-language tasks. We have come to the end of the blog. I hope this blog provides a comprehensive understanding of Molmo and inspires you to experiment with its capabilities.
Also, if you are looking for a generative AI course online, then explore: GenAI Pinnacle Program
Ans. Molmo uses CLIP because it demonstrated superior performance in handling multi-crop and high-resolution images. CLIP’s robust attention mechanisms and ability to capture spatial relationships across image patches make it more effective for complex visual tasks. In contrast, SigLIP struggled with multi-crop settings and was better suited for simpler, single-crop scenarios.
Ans. Molmo leverages the PixMo dataset, which includes high-quality, human-annotated image-caption pairs and specialized datasets like PixMo-AskModelAnything and PixMo-Points. These datasets provide diverse, real-world data that enhance Molmo’s generalization capabilities. Unlike synthetic datasets, PixMo’s human annotations ensure a richer and more natural understanding of visual content.
Ans. Yes, Molmo is designed to be highly flexible. You can customize prompts based on your specific task needs, such as extracting structured data in JSON format or answering specific queries about an image. The hands-on examples in the blog demonstrate how to adapt Molmo to various use cases, making it suitable for tasks ranging from document understanding to image captioning