DeepSeek Janus Pro 1B, launched on January 27, 2025, is an advanced multimodal AI model built to process and generate images from textual prompts. With its ability to comprehend and create images based on text, this 1 billion parameter version (1B) delivers efficient performance for a wide range of applications, including text-to-image generation and image understanding. Additionally, it excels at producing detailed captions from photos, making it a versatile tool for both creative and analytical tasks.
This article was published as a part of the Data Science Blogathon.
DeepSeek Janus Pro is a multimodal AI model that integrates text and image processing, capable of understanding and generating images from text prompts. The 1 billion parameter version (1B) is designed for efficient performance across applications like text-to-image generation and image understanding tasks.
Under DeepSeek’s Janus Pro series, the primary models available are “Janus Pro 1B” and “Janus Pro 7B”, which differ mainly in their parameter size, with the 7B model being significantly larger and offering improved performance in text-to-image generation tasks; both are considered multimodal models capable of handling both visual understanding and text generation based on visual context.
Also read: How to Access DeepSeek Janus Pro 7B?
Janus-Pro diverges from previous multimodal models by employing separate, specialized pathways for visual encoding, rather than relying on a single visual encoder for both image understanding and generation.
This decoupled architecture facilitates task-specific optimizations, mitigating conflicts between interpretation and creative synthesis. The independent encoders interpret input features which are then processed by a unified autoregressive transformer. This allows both multimodal understanding and generation components to independently select their most suitable encoding methods.
Also read: How DeepSeek’s Janus Pro Stacks Up Against DALL-E 3?
A shared transformer backbone is used for text and image feature fusion. The independent encoding methods to convert the raw inputs into features are processed by a unified autoregressive transformer.
In Previous Janus training, there was a three-stage training process for the model. The first stage focused on training the adaptors and the image head. The second stage handled unified pretraining, during which all components except the understanding encoder and the generation encoder have their parameters updated. Stage III covered supervised fine-tuning, building upon Stage II by further unlocking the parameters of the understanding encoder during training.
This was improved in Janus Pro:
Now, lets build Multimodal RAG with Deepseek Janus Pro:
In the following steps, we will build a multimodal RAG system to query on images based on the Deepseek Janus Pro 1B model.
!pip install byaldi ollama pdf2image
!sudo apt-get install -y poppler-utils
!git clone https://github.com/deepseek-ai/Janus.git
!pip install -e ./Janus
import os
from pathlib import Path
from byaldi import RAGMultiModalModel
import ollama
# Initialize RAGMultiModalModel
model1 = RAGMultiModalModel.from_pretrained("vidore/colqwen2-v0.1")
Byaldi gives an easy-to-use framework for setting up multimodal RAG systems. As seen from the above code, we load Colqwen2, which is a model designed for efficient document indexing using visual features.
# Use ColQwen2 to index and store the presentation
index_name = "image_index"
model1.index(input_path=Path("/content/PublicWaterMassMailing.pdf"),
index_name=index_name,
store_collection_with_index=True, # Stores base64 images along with the vectors
overwrite=True
)
We use this PDF to query and build an RAG system on in the next steps. In the above code, we store the image PDF along with the vectors.
query = "How many clients drive more than 50% revenue?"
returned_page = model1.search(query, k=1)[0]
import base64
# Example Base64 string (truncated for brevity)
base64_string = returned_page['base64']
# Decode the Base64 string
image_data = base64.b64decode(base64_string)
with open('output_image.png', 'wb') as image_file:
image_file.write(image_data)
The relevant page from the pages of the PDF is retrieved and saved as output_image.png based on the query.
import os
os.chdir(r"/content/Janus")
from janus.models import VLChatProcessor
from transformers import AutoConfig, AutoModelForCausalLM
import torch
from janus.utils.io import load_pil_images
from PIL import Image
processor= VLChatProcessor.from_pretrained("deepseek-ai/Janus-Pro-1B")
tokenizer = processor.tokenizer
vl_gpt = AutoModelForCausalLM.from_pretrained(
"deepseek-ai/Janus-Pro-1B", trust_remote_code=True
)
conversation = [
{
"role": "<|User|>",
"content": f"<image_placeholder>\n{query}",
"images": ['/content/output_image.png'],
},
{"role": "<|Assistant|>", "content": ""},
]
# load images and prepare for inputs
pil_images = load_pil_images(conversation)
inputs = processor(conversations=conversation, images=pil_images)
# # run image encoder to get the image embeddings
inputs_embeds = vl_gpt.prepare_inputs_embeds(**inputs)
outputs = vl_gpt.language_model.generate(
inputs_embeds=inputs_embeds,
attention_mask=inputs.attention_mask,
pad_token_id=tokenizer.eos_token_id,
bos_token_id=tokenizer.bos_token_id,
eos_token_id=tokenizer.eos_token_id,
max_new_tokens=512,
do_sample=False,
use_cache=True,
)
answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=True)
print(answer)
The code generates a response from the DeepSeek Janus Pro 1B model using the prepared input embeddings (text and image). It uses several configuration settings like padding, start/end tokens, max token length, and whether to use caching and sampling. After the response is generated, it decodes the token IDs back into human-readable text using the tokenizer. The decoded output is stored in the answer variable.
The whole code is present in this colab notebook.
“What has been the revenue in France?”
The above response is not accurate even though the relevant page was retrieved by the colqwen2 retriever, the DeepSeek Janus Pro 1B model could not generate the accurate answer from the page. The exact answer should be $2B.
“”What has been the number of promotions since beginning of FY20?”
The above response is correct as it matches with the text mentioned in the PDF.
In conclusion, the DeepSeek Janus Pro 1B model represents a significant advancement in multimodal AI, with its decoupled architecture that optimizes both image understanding and generation tasks. By utilizing separate visual encoders for these tasks and refining its training strategy, Janus Pro offers enhanced performance in text-to-image generation and image analysis. This innovative approach (Multimodal RAG with Deepseek Janus Pro), combined with its open-source accessibility, makes it a powerful tool for various applications in AI-driven visual comprehension and creation.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.
Ans. DeepSeek Janus Pro 1B is a multimodal AI model designed to integrate both text and image processing, capable of understanding and generating images from text descriptions. It features 1 billion parameters for efficient performance in tasks like text-to-image generation and image understanding.
Ans. Janus Pro uses a unified transformer architecture with decoupled visual encoding. This means it employs separate pathways for image understanding and generation, allowing task-specific optimization for each task.
Ans. Janus Pro improves on previous training strategies by increasing training steps, dropping the ImageNet dataset in favor of specialized text-to-image data, and focusing on better fine-tuning for enhanced efficiency and performance.
Ans. Janus Pro 1B is particularly useful for tasks involving text-to-image generation, image understanding, and multimodal AI applications that require both image and text processing capabilities
Ans. Janus-Pro-7B outperforms DALL-E 3 in benchmarks such as GenEval and DPG-Bench, according to DeepSeek. Janus-Pro separates understanding/generation, scales data/models for stable image generation, and maintains a unified, flexible, and cost-efficient structure. While both models perform text-to-image generation, Janus-Pro also offers image captioning, which DALL-E 3 does not.