Can SmolDocling Make Document Parsing More Efficient?

Shaik Hamzah Last Updated : 21 Mar, 2025

7 min read

Digital documents have long presented a dual challenge for both human readers and automated systems: preserving rich structural nuances while converting content into machine-processable formats. Traditional methods, whether relying on complex ensemble pipelines or massive foundational models, often struggle to balance accuracy with computational efficiency. SmolDocling emerges as a game-changing solution, offering an ultra-compact 256M-parameter vision-language model that performs end-to-end document conversion with remarkable precision and speed.

The Challenge of Document Conversion
Introducing SmolDocling
Unpacking DocTags
Deep Dive: Dataset Training and Model Architecture
Comparative Analysis: SmolDocling Versus Other Models
Code Demonstration and Output Visualization
Conclusion and Future Directions

The Challenge of Document Conversion

For decades, converting complex layouts ranging from business documents to academic papers into structured representations has been a difficult task. Common issues include:

Layout Variability: Documents present a wide array of layouts and styles.
Opaque Formats: Formats like PDF are optimized for printing rather than semantic parsing, obscuring the underlying structure.
Resource Demands: Traditional large-scale models or ensemble solutions require extensive computational resources and intricate tuning.

These challenges have led to a lot of research, but finding a solution that is both efficient and accurate is still difficult.

Introducing SmolDocling

SmolDocling addresses these hurdles head-on by leveraging a unified approach:

End-to-End Conversion: Instead of piecing together multiple specialized models, SmolDocling processes entire document pages in one go.
Compact yet Powerful: With just 256M parameters, it delivers performance comparable to models up to 27 times larger.
Robust Multi-Modal Capabilities: Whether dealing with code listings, tables, equations, or complex charts, SmolDocling adapts seamlessly across diverse document types.

At its core, the model introduces a novel markup format known as DocTags—a universal standard that meticulously captures every element’s content, structure, and spatial context.

Unpacking DocTags

DocTags revolutionize the way document elements are represented:

Structured Vocabulary: Inspired by earlier work like OTSL, DocTags use XML-style tags to explicitly differentiate between text, images, tables, code, and more.
Spatial Awareness: Each element is annotated with precise bounding box coordinates, ensuring that layout context is preserved.
Unified Representation: Whether processing a full-page document or an isolated element (like a cropped table), the format remains consistent, boosting the model’s ability to learn and generalize.

<picture> – Represents an image or visual content in the document.
<flow_chart> – Likely represents a diagram or structured graphical representation.
<caption> – Provides a description or annotation for an image or diagram.
<otsl> – Possibly represents a structured document format for tables or layouts.
<loc_XX> – Indicates the position of an element within the document.
<ched> – Likely a shorthand for “header” or “categorical header” within a table.
<fcel> – Probably refers to “formatted cell,” indicating specific cell content in tables.
<nl> – Represents a new line or a break in text.
<section_header_level_1> – Marks a major section heading in the document.
<text> – Defines general text content within the document.
<unordered_list> – Represents a bulleted or unordered list.
<list_item> – Specifies an individual item within a list.
<code> – Contains programming or script-related content, formatted for readability.

This clear, structured format minimizes ambiguity, a common issue with direct conversion methods to formats like HTML or Markdown.

Deep Dive: Dataset Training and Model Architecture

Dataset Training

A key pillar of SmolDocling’s success is its rich, diverse training data:

Pre-training Data:
- DocLayNet-PT: A 1.4M page dataset extracted from unique PDF documents sourced from CommonCrawl, Wikipedia, and business documents. This dataset is enriched with weak annotations covering layout elements, table structures, language, topics, and figure classifications.
- DocMatix: Adapted using a similar weak annotation strategy as DocLayNet-PT, this dataset includes multi-task document conversion tasks.
Task-Specific Data:
- Layout & Structure: High-quality annotated pages from DocLayNet v2, WordScape, and synthetically generated pages from SynthDocNet ensure robust layout and table structure learning.
- Charts, Code, and Equations: Custom-generated datasets provide extensive visual diversity. For instance, over 2.5 million charts are generated using three different visualization libraries, while 9.3M rendered code snippets and 5.5M formulas provide detailed coverage of technical document elements.
Instruction Tuning: To reinforce the recognition of different page elements and introduce document-related features and no-code pipelines, rule-based techniques and the Granite-3.1-2b-instruct LLM were leveraged. Using samples from DocLayNet-PT pages, one instruction was generated by randomly sampling layout elements from a page. These instructions included tasks such as:
- “Perform OCR at bbox”
- “Identify page element type at bbox”
- “Extract all section headers from the page”
Additionally, training with the Cauldron dataset helps avoid catastrophic forgetting due to the introduction of numerous conversation datasets.

Model Architecture of SmolDocling

SmolDocling builds upon the SmolVLM framework and incorporates several innovative techniques to ensure efficiency and effectiveness:

Vision Encoder with SigLIP Backbone: The model uses a SigLIP base 16/512 encoder (93M parameters) which applies an aggressive pixel shuffle strategy. This compresses each 512×512 image patch into 64 visual tokens, significantly reducing the number of image hidden states.
Enhanced Tokenization: By increasing the pixel-to-token ratio (up to 4096 pixels per token) and introducing special tokens for sub-image separation, tokenization efficiency is markedly improved. This design ensures that both full-page documents and cropped elements are processed uniformly.
Curriculum Learning Approach: Training begins with freezing the vision encoder, focusing on aligning the language model with the new DocTags format. Once the model is familiar with the output structure, the vision encoder is unfrozen and fine-tuned along with task-specific datasets, ensuring comprehensive learning.
Efficient Inference: With a maximum sequence length of 8,192 tokens and the ability to process up to three pages at a time, SmolDocling achieves page conversion times of just 0.35 seconds using VLLM on an A100 GPU, while occupying only 0.489 GB of VRAM.

SmolDocling Architecture — Source – Link

Comparative Analysis: SmolDocling Versus Other Models

A thorough evaluation of SmolDocling against leading vision-language models highlights its competitive edge:

Text Recognition (OCR) and Document Formatting

Method	Model Size	Edit Distance ↓	F1-score ↑	Precision ↑	Recall ↑	BLEU ↑	METEOR ↑
Qwen2.5 VL [9]	7B	0.56	0.72	0.80	0.70	0.46	0.57
GOT [89]	580M	0.61	0.69	0.71	0.73	0.48	0.59
Nougat (base) [12]	350M	0.62	0.66	0.72	0.67	0.44	0.54
SmolDocling (Ours)	256M	0.48	0.80	0.89	0.79	0.58	0.67

Insights: SmolDocling outperforms larger models across all key metrics in full-page transcription. The significant improvements in F1-score, precision, and recall reflect its superior capability in accurately reproducing textual elements and preserving reading order.

Specialized Tasks: Code Listings and Equations

Code Listings: For tasks like code listing transcription, SmolDocling exhibits an impressive F1-score of 0.92 and precision of 0.94, highlighting its expertise at handling indentation and syntax that carry semantic significance.
Equations: In the domain of equation recognition, SmolDocling closely matches or exceeds the performance of models like Qwen2.5 VL and GOT, achieving an F1-score of 0.95 and precision of 0.96.

These results underscore SmolDocling’s ability to not only match but often surpass the performance of models that are significantly larger in size, affirming that a compact model can be both efficient and effective when built with a focused architecture and optimized training strategies.

Code Demonstration and Output Visualization

To provide a practical glimpse into how SmolDocling operates, the following section includes a sample code snippet along with an illustration of the expected output. This example demonstrates how to convert a document image into the DocTags markup format.

Example 1: Sample Code Snippet

!pip install docling_core

!pip install flash-attn

import torch

from docling_core.types.doc import DoclingDocument

from docling_core.types.doc.document import DocTagsDocument

from transformers import AutoProcessor, AutoModelForVision2Seq

from transformers.image_utils import load_image

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

# Load images

# Initialize processor and model

processor = AutoProcessor.from_pretrained("ds4sd/SmolDocling-256M-preview")

model = AutoModelForVision2Seq.from_pretrained(

    "ds4sd/SmolDocling-256M-preview",

    torch_dtype=torch.bfloat16,

    _attn_implementation="flash_attention_2"# if DEVICE == "cuda" else "eager",

).to(DEVICE)

model.device

# Load images

image = load_image("https://user-images.githubusercontent.com/12294956/47312583-697cfe00-d65a-11e8-930a-e15fd67a5bb1.png")

# Create input messages

messages = [

    {

        "role": "user",

        "content": [

            {"type": "image"},

            {"type": "text", "text": "Convert this page to docling."}

        ]

    },

]

# Prepare inputs

prompt = processor.apply_chat_template(messages, add_generation_prompt=True)

inputs = processor(text=prompt, images=[image], return_tensors="pt")

inputs = inputs.to(DEVICE)

# Generate outputs

generated_ids = model.generate(**inputs, max_new_tokens=8192)

prompt_length = inputs.input_ids.shape[1]

trimmed_generated_ids = generated_ids[:, prompt_length:]

doctags = processor.batch_decode(

    trimmed_generated_ids,

    skip_special_tokens=False,

)[0].lstrip()

# Populate document

doctags_doc = DocTagsDocument.from_doctags_and_image_pairs([doctags], [image])

print(doctags)

# create a docling document

doc = DoclingDocument(name="Document")

doc.load_from_doctags(doctags_doc)

from IPython.display import display, Markdown

display(Markdown(doc.export_to_markdown()))

Input Image

Output

This output illustrates how various document elements—text blocks, tables, and code listings are precisely marked with their content and spatial information, making them ready for further processing or analysis. But the model is unable to convert all the text DocTags markup format. As you can see, model didn’t read the human written text.

Example 2: Sample Code Snippet

!curl -L -o image2.png https://i.imgur.com/BFN038S.png

The input image is receipt and now we are extracting the text from it.

image = load_image("./image2.png")

# Create input messages
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Convert this page to docling."}
        ]
    },
]

# Prepare inputs
prompt1 = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs1 = processor(text=prompt1, images=[image], return_tensors="pt")
inputs1 = inputs1.to(DEVICE)

# Generate outputs
generated_ids = model.generate(**inputs1, max_new_tokens=8192)
prompt_length = inputs1.input_ids.shape[1]
trimmed_generated_ids = generated_ids[:, prompt_length:]
doctags = processor.batch_decode(
    trimmed_generated_ids,
    skip_special_tokens=False,
)[0].lstrip()

# Populate document
doctags_doc = DocTagsDocument.from_doctags_and_image_pairs([doctags], [image])
print(doctags)
# create a docling document
doc = DoclingDocument(name="Document")
doc.load_from_doctags(doctags_doc)

# export as any format
# HTML
# doc.save_as_html(output_file)
# MD
print(doc.export_to_markdown())

from IPython.display import display, Markdown

display(Markdown(doc.export_to_markdown()))

Output

It is quite impressive as the model extracted all the content from the receipt and it is better than the obove given example.

Notebook with full code: Click Here

Conclusion and Future Directions

SmolDocling sets a new benchmark in document conversion by proving that smaller, more efficient models can rival and even surpass the capabilities of their larger counterparts. Its innovative use of DocTags and an end-to-end conversion strategy provide a compelling blueprint for the next generation of vision-language models. It works well with receipts overall and performs acceptably with other documents, though not always perfectly this serves as a consequence of its memory-saving model design.

Key Takeaways

Efficiency: With a compact 256M parameter architecture, SmolDocling achieves rapid page conversion with minimal computational overhead.
Robustness: Extensive pre-training and task-specific datasets, along with a curriculum learning approach, ensure that the model generalizes well across diverse document types.
Comparative Superiority: Through rigorous evaluations, SmolDocling has demonstrated superior performance in OCR, code listing transcription, and equation recognition compared to larger models.

As the research community continues to refine techniques for element localization and multimodal understanding, SmolDocling provides a clear pathway toward more resource-efficient and versatile document processing solutions. With plans to release the accompanying datasets publicly, this work paves the way for further advancements and collaborations in the field.

Shaik Hamzah

GenAI Intern @ Analytics Vidhya | Final Year @ VIT Chennai
Passionate about AI and machine learning, I'm eager to dive into roles as an AI/ML Engineer or Data Scientist where I can make a real impact. With a knack for quick learning and a love for teamwork, I'm excited to bring innovative solutions and cutting-edge advancements to the table. My curiosity drives me to explore AI across various fields and take the initiative to delve into data engineering, ensuring I stay ahead and deliver impactful projects.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to NLP

Text Pre-processing

NLP Libraries

Regular Expressions

String Similarity

Spelling Correction

Topic Modeling

Text Representation

Information Retrieval System

Word Vectors

Word Senses

Dependency Parsing

Language Modeling

Getting Started with RNN

Different Variants of RNN

Machine Translation and Attention

Self Attention and Transformers

Transfomers and Pretraining

Question Answering

Text Summarization

Named Entity Recognition

Coreference Resolution

Audio Data

ASR

Audio Separation

Chatbot

Auto NLP

Can SmolDocling Make Document Parsing More Efficient?

Table of contents

The Challenge of Document Conversion

Introducing SmolDocling

Unpacking DocTags

Deep Dive: Dataset Training and Model Architecture

Dataset Training

Model Architecture of SmolDocling

Comparative Analysis: SmolDocling Versus Other Models

Text Recognition (OCR) and Document Formatting

Specialized Tasks: Code Listings and Equations

Code Demonstration and Output Visualization

Example 1: Sample Code Snippet

Input Image

Output

Example 2: Sample Code Snippet

Output

Conclusion and Future Directions

Key Takeaways

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie