Enhancing Multimodal RAG with Deepseek Janus Pro

Nibedita Dutta Last Updated : 15 Feb, 2025

8 min read

DeepSeek Janus Pro 1B, launched on January 27, 2025, is an advanced multimodal AI model built to process and generate images from textual prompts. With its ability to comprehend and create images based on text, this 1 billion parameter version (1B) delivers efficient performance for a wide range of applications, including text-to-image generation and image understanding. Additionally, it excels at producing detailed captions from photos, making it a versatile tool for both creative and analytical tasks.

Learning Objectives

Analyzing its architecture and key features that enhance its capabilities.
Exploring the underlying design and its impact on performance.
A step-by-step guide to building a Retrieval-Augmented Generation (RAG) system.
Utilizing the DeepSeek Janus Pro 1 billion model for real-world applications.
Understanding how DeepSeek Janus Pro optimizes AI-driven solutions.

This article was published as a part of the Data Science Blogathon.

Learning Objectives
What is DeepSeek Janus Pro?
- Key Features and Design Aspects of Janus Pro 1B
Decoupled Architecture For Image Understanding & Generation
Multimodal RAG with Deepseek Janus Pro 1B model
Conclusions
Key Takeaways
Frequently Asked Questions

What is DeepSeek Janus Pro?

DeepSeek Janus Pro is a multimodal AI model that integrates text and image processing, capable of understanding and generating images from text prompts. The 1 billion parameter version (1B) is designed for efficient performance across applications like text-to-image generation and image understanding tasks.

Under DeepSeek’s Janus Pro series, the primary models available are “Janus Pro 1B” and “Janus Pro 7B”, which differ mainly in their parameter size, with the 7B model being significantly larger and offering improved performance in text-to-image generation tasks; both are considered multimodal models capable of handling both visual understanding and text generation based on visual context.

Key Features and Design Aspects of Janus Pro 1B

Architecture: Janus Pro uses a unified transformer architecture but decouples visual encoding into separate pathways to improve performance in both image understanding and creation tasks.
Capabilities: It excels in tasks related to both understanding of images and the generation of new ones based on text prompts. It supports 384×384 image inputs.
Image Encoders: For image understanding tasks, Janus uses SigLIP to encode images. SigLIP is an image embedding model that uses CLIP’s framework but replaces the loss function with a pairwise sigmoid loss. For image generation, Janus uses an existing encoder from LlamaGen, an autoregressive image generation mode. LlamaGen is a family of image-generation models that applies the next-token prediction paradigm of large language models to a visual generation
Open Source: It is available on GitHub under the MIT License, with model usage governed by the DeepSeek Model License.

Also read: How to Access DeepSeek Janus Pro 7B?

Decoupled Architecture For Image Understanding & Generation

Janus-Pro diverges from previous multimodal models by employing separate, specialized pathways for visual encoding, rather than relying on a single visual encoder for both image understanding and generation.

Image Understanding Encoder. This pathway extracts semantic features from images.
Image Generation Encoder. This pathway synthesizes images based on text descriptions.

This decoupled architecture facilitates task-specific optimizations, mitigating conflicts between interpretation and creative synthesis. The independent encoders interpret input features which are then processed by a unified autoregressive transformer. This allows both multimodal understanding and generation components to independently select their most suitable encoding methods.

Also read: How DeepSeek’s Janus Pro Stacks Up Against DALL-E 3?

Key Features of Model Architecture

1. Dual-pathway architecture for visual understanding & generation

Visual Understanding Pathway: For multimodal understanding tasks, Janus Pro uses SigLIP-L as the visual encoder, which supports image inputs of up to 384×384 resolution. This high-resolution support allows the model to capture more image details, thereby improving the accuracy of visual understanding.
Visual Generation Pathway: For image generation tasks, Janus Pro uses LlamaGen Tokenizer with a downsampling rate of 16 to generate more detailed images.

Fig 1. The architecture of our Janus-Pro. We decouple visual encoding for multimodal understanding and visual generation. “Und. Encoder” and “Gen. Encoder” are abbreviations for “Understanding Encoder” and “Generation Encoder”, respectively. Source: DeepSeek Janus-Pro

2. Unified Transformer Architecture

A shared transformer backbone is used for text and image feature fusion. The independent encoding methods to convert the raw inputs into features are processed by a unified autoregressive transformer.

3. Optimized Training Strategy

In Previous Janus training, there was a three-stage training process for the model. The first stage focused on training the adaptors and the image head. The second stage handled unified pretraining, during which all components except the understanding encoder and the generation encoder have their parameters updated. Stage III covered supervised fine-tuning, building upon Stage II by further unlocking the parameters of the understanding encoder during training.

This was improved in Janus Pro:

By increasing the training steps in Stage I, allowing sufficient training on the ImageNet dataset.
Additionally, in Stage II, for text-to-image generation training, the ImageNet data was dropped completely. Instead normal text-to-image data was utilized to train the model to generate images based on dense descriptions. This was found to improve the training efficiency and overall performance.

Now, lets build Multimodal RAG with Deepseek Janus Pro:

Multimodal RAG with Deepseek Janus Pro 1B model

In the following steps, we will build a multimodal RAG system to query on images based on the Deepseek Janus Pro 1B model.

Step 1. Install Necessary Libraries

!pip install byaldi ollama pdf2image
!sudo apt-get install -y poppler-utils
!git clone https://github.com/deepseek-ai/Janus.git
!pip install -e ./Janus

Step 2. Model For Saving Image Embeddings

import os
from pathlib import Path
from byaldi import RAGMultiModalModel
import ollama
# Initialize RAGMultiModalModel
model1 = RAGMultiModalModel.from_pretrained("vidore/colqwen2-v0.1")

Byaldi gives an easy-to-use framework for setting up multimodal RAG systems. As seen from the above code, we load Colqwen2, which is a model designed for efficient document indexing using visual features.

Step 3. Loading the Image PDF

# Use ColQwen2 to index and store the presentation
index_name = "image_index"
model1.index(input_path=Path("/content/PublicWaterMassMailing.pdf"),
    index_name=index_name,
    store_collection_with_index=True, # Stores base64 images along with the vectors
    overwrite=True
)

We use this PDF to query and build an RAG system on in the next steps. In the above code, we store the image PDF along with the vectors.

Step 4. Querying & Retrieval From Saved Images

query = "How many clients drive more than 50% revenue?"
returned_page = model1.search(query, k=1)[0]
import base64
# Example Base64 string (truncated for brevity)
base64_string = returned_page['base64']

# Decode the Base64 string
image_data = base64.b64decode(base64_string)
with open('output_image.png', 'wb') as image_file:
    image_file.write(image_data)

The relevant page from the pages of the PDF is retrieved and saved as output_image.png based on the query.

Step 5. Load Janus Pro Model

import os
os.chdir(r"/content/Janus")

from janus.models import VLChatProcessor
from transformers import AutoConfig, AutoModelForCausalLM
import torch
from janus.utils.io import load_pil_images
from PIL import Image

processor= VLChatProcessor.from_pretrained("deepseek-ai/Janus-Pro-1B")
tokenizer = processor.tokenizer
vl_gpt = AutoModelForCausalLM.from_pretrained(
    "deepseek-ai/Janus-Pro-1B", trust_remote_code=True
)

conversation = [
    {
        "role": "<|User|>",
        "content": f"<image_placeholder>\n{query}",
        "images": ['/content/output_image.png'],
    },
    {"role": "<|Assistant|>", "content": ""},
]

# load images and prepare for inputs
pil_images = load_pil_images(conversation)
inputs = processor(conversations=conversation, images=pil_images)

# # run image encoder to get the image embeddings
inputs_embeds = vl_gpt.prepare_inputs_embeds(**inputs)

VLChatProcessor.from_pretrained(“deepseek-ai/Janus-Pro-1B”) loads a pretrained processor for handling multimodal inputs (images and text). This processor will process and prepare input data (like text and images) for the model.
The tokenizer is extracted from the VLChatProcessor. It will tokenize the text input, converting text into a format suitable for the model.
AutoModelForCausalLM.from_pretrained(“deepseek-ai/Janus-Pro-1B”) loads the pre-trained Janus Pro model, specifically for causal language modelling.
Also, a multimodal conversation format is set up where the user inputs both text and an image.
The load_pil_images(conversation) is a function that likely loads the images listed in the conversation object and converts them into PIL Image format, which is commonly used for image processing in Python.
The processor here is an instance of a multimodal processor (the VLChatProcessor from the DeepSeek Janus Pro model), which takes both text and image data as input.
prepare_inputs_embeds(inputs) is a method that takes the processed inputs (inputs contain both the text and image) , and prepares the embeddings required for the model to generate a response.

Step 6. Output Generation

outputs =  vl_gpt.language_model.generate(
    inputs_embeds=inputs_embeds,
    attention_mask=inputs.attention_mask,
    pad_token_id=tokenizer.eos_token_id,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
    max_new_tokens=512,
    do_sample=False,
    use_cache=True,
)

answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=True)
print(answer)

The code generates a response from the DeepSeek Janus Pro 1B model using the prepared input embeddings (text and image). It uses several configuration settings like padding, start/end tokens, max token length, and whether to use caching and sampling. After the response is generated, it decodes the token IDs back into human-readable text using the tokenizer. The decoded output is stored in the answer variable.

The whole code is present in this colab notebook.

Output For the Query

Output For Another Query

“What has been the revenue in France?”

The above response is not accurate even though the relevant page was retrieved by the colqwen2 retriever, the DeepSeek Janus Pro 1B model could not generate the accurate answer from the page. The exact answer should be $2B.

Output For Another Query

“”What has been the number of promotions since beginning of FY20?”

The above response is correct as it matches with the text mentioned in the PDF.

Conclusions

In conclusion, the DeepSeek Janus Pro 1B model represents a significant advancement in multimodal AI, with its decoupled architecture that optimizes both image understanding and generation tasks. By utilizing separate visual encoders for these tasks and refining its training strategy, Janus Pro offers enhanced performance in text-to-image generation and image analysis. This innovative approach (Multimodal RAG with Deepseek Janus Pro), combined with its open-source accessibility, makes it a powerful tool for various applications in AI-driven visual comprehension and creation.

Key Takeaways

Multimodal AI with Dual Pathways: Janus Pro 1B integrates both text and image processing, using separate encoders for image understanding (SigLIP) and image generation (LlamaGen), enhancing task-specific performance.
Decoupled Architecture: The model separates visual encoding into distinct pathways, enabling independent optimization for image understanding and generation, thus minimizing conflicts in processing tasks.
Unified Transformer Backbone: A shared transformer architecture merges the features of text and images, streamlining multimodal data fusion for more effective AI performance.
Improved Training Strategy: Janus Pro’s optimized training approach includes increased steps in Stage I and the use of specialized text-to-image data in Stage II, significantly boosting training efficiency and output quality.
Open-Source Accessibility: Janus Pro 1B is available on GitHub under the MIT License, encouraging widespread use and adaptation in various AI-driven applications.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Frequently Asked Questions

Q1. What is DeepSeek Janus Pro 1B?

Ans. DeepSeek Janus Pro 1B is a multimodal AI model designed to integrate both text and image processing, capable of understanding and generating images from text descriptions. It features 1 billion parameters for efficient performance in tasks like text-to-image generation and image understanding.

Q2. How does the architecture of Janus Pro 1B work?

Ans. Janus Pro uses a unified transformer architecture with decoupled visual encoding. This means it employs separate pathways for image understanding and generation, allowing task-specific optimization for each task.

Q3. How does the training process of Janus Pro differ from previous versions?

Ans. Janus Pro improves on previous training strategies by increasing training steps, dropping the ImageNet dataset in favor of specialized text-to-image data, and focusing on better fine-tuning for enhanced efficiency and performance.

Q4. What kind of applications can benefit from using Janus Pro 1B?

Ans. Janus Pro 1B is particularly useful for tasks involving text-to-image generation, image understanding, and multimodal AI applications that require both image and text processing capabilities

Q5. How does Janus-Pro compare to other models like DALL-E 3?

Ans. Janus-Pro-7B outperforms DALL-E 3 in benchmarks such as GenEval and DPG-Bench, according to DeepSeek. Janus-Pro separates understanding/generation, scales data/models for stable image generation, and maintains a unified, flexible, and cost-efficient structure. While both models perform text-to-image generation, Janus-Pro also offers image captioning, which DALL-E 3 does not.

Nibedita Dutta

Nibedita completed her master’s in Chemical Engineering from IIT Kharagpur in 2014 and is currently working as a Senior Data Scientist. In her current capacity, she works on building intelligent ML-based solutions to improve business processes.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Enhancing Multimodal RAG with Deepseek Janus Pro

Learning Objectives

Table of contents

What is DeepSeek Janus Pro?

Key Features and Design Aspects of Janus Pro 1B

Decoupled Architecture For Image Understanding & Generation

Key Features of Model Architecture

1. Dual-pathway architecture for visual understanding & generation

2. Unified Transformer Architecture

3. Optimized Training Strategy

Multimodal RAG with Deepseek Janus Pro 1B model

Step 1. Install Necessary Libraries

Step 2. Model For Saving Image Embeddings

Step 3. Loading the Image PDF

Step 4. Querying & Retrieval From Saved Images

Step 5. Load Janus Pro Model

Step 6. Output Generation

Output For the Query

Output For Another Query

Output For Another Query

Conclusions

Key Takeaways

Frequently Asked Questions

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang

s_tp

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

s_pltp

s_tslv

li_theme

li_theme_set

Google (11)

_gcl_au

SID

SAPISID

__Secure-#

APISID