Enhancing Multimodal RAG with Deepseek Janus Pro

Nibedita Dutta Last Updated : 15 Feb, 2025
8 min read

DeepSeek Janus Pro 1B, launched on January 27, 2025, is an advanced multimodal AI model built to process and generate images from textual prompts. With its ability to comprehend and create images based on text, this 1 billion parameter version (1B) delivers efficient performance for a wide range of applications, including text-to-image generation and image understanding. Additionally, it excels at producing detailed captions from photos, making it a versatile tool for both creative and analytical tasks.

Learning Objectives

  • Analyzing its architecture and key features that enhance its capabilities.
  • Exploring the underlying design and its impact on performance.
  • A step-by-step guide to building a Retrieval-Augmented Generation (RAG) system.
  • Utilizing the DeepSeek Janus Pro 1 billion model for real-world applications.
  • Understanding how DeepSeek Janus Pro optimizes AI-driven solutions.

This article was published as a part of the Data Science Blogathon.

What is DeepSeek Janus Pro?

DeepSeek Janus Pro is a multimodal AI model that integrates text and image processing, capable of understanding and generating images from text prompts. The 1 billion parameter version (1B) is designed for efficient performance across applications like text-to-image generation and image understanding tasks.

Under DeepSeek’s Janus Pro series, the primary models available are “Janus Pro 1B” and “Janus Pro 7B”, which differ mainly in their parameter size, with the 7B model being significantly larger and offering improved performance in text-to-image generation tasks; both are considered multimodal models capable of handling both visual understanding and text generation based on visual context.

Key Features and Design Aspects of Janus Pro 1B

  • Architecture: Janus Pro uses a unified transformer architecture but decouples visual encoding into separate pathways to improve performance in both image understanding and creation tasks.
  • Capabilities: It excels in tasks related to both understanding of images and the generation of new ones based on text prompts. It supports 384×384 image inputs.
  • Image Encoders: For image understanding tasks, Janus uses SigLIP to encode images. SigLIP is an image embedding model that uses CLIP’s framework but replaces the loss function with a pairwise sigmoid loss. For image generation, Janus uses an existing encoder from LlamaGen, an autoregressive image generation mode. LlamaGen is a family of image-generation models that applies the next-token prediction paradigm of large language models to a visual generation
  • Open Source: It is available on GitHub under the MIT License, with model usage governed by the DeepSeek Model License.

Also read: How to Access DeepSeek Janus Pro 7B?

Decoupled Architecture For Image Understanding & Generation

Architectural Features of Deepsee
Architectural Features of Deepsee

Janus-Pro diverges from previous multimodal models by employing separate, specialized pathways for visual encoding, rather than relying on a single visual encoder for both image understanding and generation.

  • Image Understanding Encoder. This pathway extracts semantic features from images.
  • Image Generation Encoder. This pathway synthesizes images based on text descriptions.

This decoupled architecture facilitates task-specific optimizations, mitigating conflicts between interpretation and creative synthesis. The independent encoders interpret input features which are then processed by a unified autoregressive transformer. This allows both multimodal understanding and generation components to independently select their most suitable encoding methods.

Also read: How DeepSeek’s Janus Pro Stacks Up Against DALL-E 3?

Key Features of Model Architecture

1. Dual-pathway architecture for visual understanding & generation

  • Visual Understanding Pathway: For multimodal understanding tasks, Janus Pro uses SigLIP-L as the visual encoder, which supports image inputs of up to 384×384 resolution. This high-resolution support allows the model to capture more image details, thereby improving the accuracy of visual understanding.  
  • Visual Generation Pathway: For image generation tasks, Janus Pro uses LlamaGen Tokenizer with a downsampling rate of 16 to generate more detailed images.  
DeepSeek Janus-Pro
Fig 1. The architecture of our Janus-Pro. We decouple visual encoding for multimodal understanding and visual generation. “Und. Encoder” and “Gen. Encoder” are abbreviations for “Understanding Encoder” and “Generation Encoder”, respectively. Source: DeepSeek Janus-Pro

2. Unified Transformer Architecture

A shared transformer backbone is used for text and image feature fusion. The independent encoding methods to convert the raw inputs into features are processed by a unified autoregressive transformer.  

3. Optimized Training Strategy

In Previous Janus training, there was a three-stage training process for the model. The first stage focused on training the adaptors and the image head. The second stage handled unified pretraining, during which all components except the understanding encoder and the generation encoder have their parameters updated. Stage III covered supervised fine-tuning, building upon Stage II by further unlocking the parameters of the understanding encoder during training.

This was improved in Janus Pro:

  • By increasing the training steps in Stage I, allowing sufficient training on the ImageNet dataset.
  • Additionally, in Stage II, for text-to-image generation training, the ImageNet data was dropped completely. Instead normal text-to-image data was utilized to train the model to generate images based on dense descriptions. This was found to improve the training efficiency and overall performance.

Now, lets build Multimodal RAG with Deepseek Janus Pro:

Multimodal RAG with Deepseek Janus Pro 1B model

In the following steps, we will build a multimodal RAG system to query on images based on the Deepseek Janus Pro 1B model.

Step 1. Install Necessary Libraries

!pip install byaldi ollama pdf2image
!sudo apt-get install -y poppler-utils
!git clone https://github.com/deepseek-ai/Janus.git
!pip install -e ./Janus

Step 2. Model For Saving Image Embeddings

import os
from pathlib import Path
from byaldi import RAGMultiModalModel
import ollama
# Initialize RAGMultiModalModel
model1 = RAGMultiModalModel.from_pretrained("vidore/colqwen2-v0.1")

Byaldi gives an easy-to-use framework for setting up multimodal RAG systems. As seen from the above code, we load Colqwen2, which is a model designed for efficient document indexing using visual features. 

Step 3. Loading the Image PDF

# Use ColQwen2 to index and store the presentation
index_name = "image_index"
model1.index(input_path=Path("/content/PublicWaterMassMailing.pdf"),
    index_name=index_name,
    store_collection_with_index=True, # Stores base64 images along with the vectors
    overwrite=True
)

We use this PDF to query and build an RAG system on in the next steps. In the above code, we store the image PDF along with the vectors.

Step 4. Querying & Retrieval From Saved Images

query = "How many clients drive more than 50% revenue?"
returned_page = model1.search(query, k=1)[0]
import base64
# Example Base64 string (truncated for brevity)
base64_string = returned_page['base64']

# Decode the Base64 string
image_data = base64.b64decode(base64_string)
with open('output_image.png', 'wb') as image_file:
    image_file.write(image_data)

The relevant page from the pages of the PDF is retrieved and saved as output_image.png based on the query.

Step 5. Load Janus Pro Model

import os
os.chdir(r"/content/Janus")

from janus.models import VLChatProcessor
from transformers import AutoConfig, AutoModelForCausalLM
import torch
from janus.utils.io import load_pil_images
from PIL import Image

processor= VLChatProcessor.from_pretrained("deepseek-ai/Janus-Pro-1B")
tokenizer = processor.tokenizer
vl_gpt = AutoModelForCausalLM.from_pretrained(
    "deepseek-ai/Janus-Pro-1B", trust_remote_code=True
)

conversation = [
    {
        "role": "<|User|>",
        "content": f"<image_placeholder>\n{query}",
        "images": ['/content/output_image.png'],
    },
    {"role": "<|Assistant|>", "content": ""},
]

# load images and prepare for inputs
pil_images = load_pil_images(conversation)
inputs = processor(conversations=conversation, images=pil_images)

# # run image encoder to get the image embeddings
inputs_embeds = vl_gpt.prepare_inputs_embeds(**inputs)
  • VLChatProcessor.from_pretrained(“deepseek-ai/Janus-Pro-1B”) loads a pretrained processor for handling multimodal inputs (images and text). This processor will process and prepare input data (like text and images) for the model.
  • The tokenizer is extracted from the VLChatProcessor. It will tokenize the text input, converting text into a format suitable for the model.
  • AutoModelForCausalLM.from_pretrained(“deepseek-ai/Janus-Pro-1B”) loads the pre-trained Janus Pro model, specifically for causal language modelling.
  • Also, a multimodal conversation format is set up where the user inputs both text and an image.
  • The load_pil_images(conversation) is a function that likely loads the images listed in the conversation object and converts them into PIL Image format, which is commonly used for image processing in Python.
  • The processor here is an instance of a multimodal processor (the VLChatProcessor from the DeepSeek Janus Pro model), which takes both text and image data as input.
  • prepare_inputs_embeds(inputs) is a method that takes the processed inputs (inputs contain both the text and image) , and prepares the embeddings required for the model to generate a response.

Step 6. Output Generation

outputs =  vl_gpt.language_model.generate(
    inputs_embeds=inputs_embeds,
    attention_mask=inputs.attention_mask,
    pad_token_id=tokenizer.eos_token_id,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
    max_new_tokens=512,
    do_sample=False,
    use_cache=True,
)

answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=True)
print(answer)

The code generates a response from the DeepSeek Janus Pro 1B model using the prepared input embeddings (text and image). It uses several configuration settings like padding, start/end tokens, max token length, and whether to use caching and sampling. After the response is generated, it decodes the token IDs back into human-readable text using the tokenizer. The decoded output is stored in the answer variable. 

The whole code is present in this colab notebook.  

Output For the Query

output

Output For Another Query

“What has been the revenue in France?”

output

The above response is not accurate even though the relevant page was retrieved by the colqwen2 retriever, the DeepSeek Janus Pro 1B model could not generate the accurate answer from the page. The exact answer should be $2B.

Output For Another Query

“”What has been the number of promotions since beginning of FY20?”

output

The above response is correct as it matches with the text mentioned in the PDF.

Conclusions

In conclusion, the DeepSeek Janus Pro 1B model represents a significant advancement in multimodal AI, with its decoupled architecture that optimizes both image understanding and generation tasks. By utilizing separate visual encoders for these tasks and refining its training strategy, Janus Pro offers enhanced performance in text-to-image generation and image analysis. This innovative approach (Multimodal RAG with Deepseek Janus Pro), combined with its open-source accessibility, makes it a powerful tool for various applications in AI-driven visual comprehension and creation.

Key Takeaways

  1. Multimodal AI with Dual Pathways: Janus Pro 1B integrates both text and image processing, using separate encoders for image understanding (SigLIP) and image generation (LlamaGen), enhancing task-specific performance.
  2. Decoupled Architecture: The model separates visual encoding into distinct pathways, enabling independent optimization for image understanding and generation, thus minimizing conflicts in processing tasks.
  3. Unified Transformer Backbone: A shared transformer architecture merges the features of text and images, streamlining multimodal data fusion for more effective AI performance.
  4. Improved Training Strategy: Janus Pro’s optimized training approach includes increased steps in Stage I and the use of specialized text-to-image data in Stage II, significantly boosting training efficiency and output quality.
  5. Open-Source Accessibility: Janus Pro 1B is available on GitHub under the MIT License, encouraging widespread use and adaptation in various AI-driven applications.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Frequently Asked Questions

Q1. What is DeepSeek Janus Pro 1B?

Ans. DeepSeek Janus Pro 1B is a multimodal AI model designed to integrate both text and image processing, capable of understanding and generating images from text descriptions. It features 1 billion parameters for efficient performance in tasks like text-to-image generation and image understanding.

Q2. How does the architecture of Janus Pro 1B work?

Ans. Janus Pro uses a unified transformer architecture with decoupled visual encoding. This means it employs separate pathways for image understanding and generation, allowing task-specific optimization for each task.

Q3. How does the training process of Janus Pro differ from previous versions?

Ans. Janus Pro improves on previous training strategies by increasing training steps, dropping the ImageNet dataset in favor of specialized text-to-image data, and focusing on better fine-tuning for enhanced efficiency and performance.

Q4. What kind of applications can benefit from using Janus Pro 1B?

Ans. Janus Pro 1B is particularly useful for tasks involving text-to-image generation, image understanding, and multimodal AI applications that require both image and text processing capabilities

Q5. How does Janus-Pro compare to other models like DALL-E 3?

Ans. Janus-Pro-7B outperforms DALL-E 3 in benchmarks such as GenEval and DPG-Bench, according to DeepSeek. Janus-Pro separates understanding/generation, scales data/models for stable image generation, and maintains a unified, flexible, and cost-efficient structure. While both models perform text-to-image generation, Janus-Pro also offers image captioning, which DALL-E 3 does not.

Nibedita completed her master’s in Chemical Engineering from IIT Kharagpur in 2014 and is currently working as a Senior Data Scientist. In her current capacity, she works on building intelligent ML-based solutions to improve business processes.

Responses From Readers

Clear

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details