Digital documents have long presented a dual challenge for both human readers and automated systems: preserving rich structural nuances while converting content into machine-processable formats. Traditional methods, whether relying on complex ensemble pipelines or massive foundational models, often struggle to balance accuracy with computational efficiency. SmolDocling emerges as a game-changing solution, offering an ultra-compact 256M-parameter vision-language model that performs end-to-end document conversion with remarkable precision and speed.
For decades, converting complex layouts ranging from business documents to academic papers into structured representations has been a difficult task. Common issues include:
These challenges have led to a lot of research, but finding a solution that is both efficient and accurate is still difficult.
SmolDocling addresses these hurdles head-on by leveraging a unified approach:
At its core, the model introduces a novel markup format known as DocTags—a universal standard that meticulously captures every element’s content, structure, and spatial context.
DocTags revolutionize the way document elements are represented:
This clear, structured format minimizes ambiguity, a common issue with direct conversion methods to formats like HTML or Markdown.
A key pillar of SmolDocling’s success is its rich, diverse training data:
SmolDocling builds upon the SmolVLM framework and incorporates several innovative techniques to ensure efficiency and effectiveness:
A thorough evaluation of SmolDocling against leading vision-language models highlights its competitive edge:
Method | Model Size | Edit Distance ↓ | F1-score ↑ | Precision ↑ | Recall ↑ | BLEU ↑ | METEOR ↑ |
Qwen2.5 VL [9] | 7B | 0.56 | 0.72 | 0.80 | 0.70 | 0.46 | 0.57 |
GOT [89] | 580M | 0.61 | 0.69 | 0.71 | 0.73 | 0.48 | 0.59 |
Nougat (base) [12] | 350M | 0.62 | 0.66 | 0.72 | 0.67 | 0.44 | 0.54 |
SmolDocling (Ours) | 256M | 0.48 | 0.80 | 0.89 | 0.79 | 0.58 | 0.67 |
Insights: SmolDocling outperforms larger models across all key metrics in full-page transcription. The significant improvements in F1-score, precision, and recall reflect its superior capability in accurately reproducing textual elements and preserving reading order.
These results underscore SmolDocling’s ability to not only match but often surpass the performance of models that are significantly larger in size, affirming that a compact model can be both efficient and effective when built with a focused architecture and optimized training strategies.
To provide a practical glimpse into how SmolDocling operates, the following section includes a sample code snippet along with an illustration of the expected output. This example demonstrates how to convert a document image into the DocTags markup format.
!pip install docling_core
!pip install flash-attn
import torch
from docling_core.types.doc import DoclingDocument
from docling_core.types.doc.document import DocTagsDocument
from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers.image_utils import load_image
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
# Load images
# Initialize processor and model
processor = AutoProcessor.from_pretrained("ds4sd/SmolDocling-256M-preview")
model = AutoModelForVision2Seq.from_pretrained(
"ds4sd/SmolDocling-256M-preview",
torch_dtype=torch.bfloat16,
_attn_implementation="flash_attention_2"# if DEVICE == "cuda" else "eager",
).to(DEVICE)
model.device
# Load images
image = load_image("https://user-images.githubusercontent.com/12294956/47312583-697cfe00-d65a-11e8-930a-e15fd67a5bb1.png")
# Create input messages
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "Convert this page to docling."}
]
},
]
# Prepare inputs
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="pt")
inputs = inputs.to(DEVICE)
# Generate outputs
generated_ids = model.generate(**inputs, max_new_tokens=8192)
prompt_length = inputs.input_ids.shape[1]
trimmed_generated_ids = generated_ids[:, prompt_length:]
doctags = processor.batch_decode(
trimmed_generated_ids,
skip_special_tokens=False,
)[0].lstrip()
# Populate document
doctags_doc = DocTagsDocument.from_doctags_and_image_pairs([doctags], [image])
print(doctags)
# create a docling document
doc = DoclingDocument(name="Document")
doc.load_from_doctags(doctags_doc)
from IPython.display import display, Markdown
display(Markdown(doc.export_to_markdown()))
This output illustrates how various document elements—text blocks, tables, and code listings are precisely marked with their content and spatial information, making them ready for further processing or analysis. But the model is unable to convert all the text DocTags markup format. As you can see, model didn’t read the human written text.
!curl -L -o image2.png https://i.imgur.com/BFN038S.png
The input image is receipt and now we are extracting the text from it.
image = load_image("./image2.png")
# Create input messages
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "Convert this page to docling."}
]
},
]
# Prepare inputs
prompt1 = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs1 = processor(text=prompt1, images=[image], return_tensors="pt")
inputs1 = inputs1.to(DEVICE)
# Generate outputs
generated_ids = model.generate(**inputs1, max_new_tokens=8192)
prompt_length = inputs1.input_ids.shape[1]
trimmed_generated_ids = generated_ids[:, prompt_length:]
doctags = processor.batch_decode(
trimmed_generated_ids,
skip_special_tokens=False,
)[0].lstrip()
# Populate document
doctags_doc = DocTagsDocument.from_doctags_and_image_pairs([doctags], [image])
print(doctags)
# create a docling document
doc = DoclingDocument(name="Document")
doc.load_from_doctags(doctags_doc)
# export as any format
# HTML
# doc.save_as_html(output_file)
# MD
print(doc.export_to_markdown())
from IPython.display import display, Markdown
display(Markdown(doc.export_to_markdown()))
It is quite impressive as the model extracted all the content from the receipt and it is better than the obove given example.
Notebook with full code: Click Here
SmolDocling sets a new benchmark in document conversion by proving that smaller, more efficient models can rival and even surpass the capabilities of their larger counterparts. Its innovative use of DocTags and an end-to-end conversion strategy provide a compelling blueprint for the next generation of vision-language models. It works well with receipts overall and performs acceptably with other documents, though not always perfectly this serves as a consequence of its memory-saving model design.
As the research community continues to refine techniques for element localization and multimodal understanding, SmolDocling provides a clear pathway toward more resource-efficient and versatile document processing solutions. With plans to release the accompanying datasets publicly, this work paves the way for further advancements and collaborations in the field.