Detecting Table Rows and Columns in Images Using Transformers

Mobarak Inuwa Last Updated : 25 Aug, 2023
8 min read

Introduction

Have you ever worked with unstructured data and thought of a way to detect the presence of tables in your document? To help you quickly process your documents? In this article, we will look at not only detecting the presence of the tables but recognizing the structure of these tables through images using transformers. This will be made possible by two distinct models. One is for table detection in documents, and the second is for structure recognition, which recognizes the individual rows and columns in the table.

Learning Objectives

  • How to detect table rows and columns on images?
  • A look at Table Transformers and Detection Transformer (DETR)
  • About PubTables-1M Dataset
  • How to perform inference with Table Transformer
Images Using Transformers

Documents, articles, and pdf files are valuable sources of information, often containing tables conveying critical data. Efficiently extracting information from these tables can be complex due to challenges between different formattings and representations. It could be time-consuming and stressful to copy or recreate these tables manually. Table transformers trained on the PubTables-1M dataset address the problem in table detection, structure recognition, and functional analysis.

This article was published as a part of the Data Science Blogathon.

How was This Done?

This is made possible by a transformer model known as Table Transformer. It uses a novel approach for detecting documents or images like in articles, using a large annotated dataset named PubTables-1M. This dataset contains about a million parameters and was implemented using some measures, giving the model a state-of-the-art feel. The efficiency was achieved by addressing the challenges of imperfect annotations, spatial alignment issues, and table structure consistency. The research paper published with the model leveraged the Detection Transformer (DETR) model for joint modeling of table structure recognition (TSR) and functional analysis (FA). So, the DETR model is the backbone where the Table Transformer runs, which Microsoft Research developed. Let us look at the DETR a bit more.

DEtection TRansformer (DETR)

As mentioned earlier, the DETR is short for DEtection TRansformer, and consists of a convolutional backbone such as the ResNet architecture using an encoder-decoder Transformer. This gives it the potential to carry out object detection tasks. DETR offers an approach that does not require complicated models such as Faster R-CNN and Mask R-CNN that depend on intricate elements like region proposals, non-maximum suppression, and anchor generation. It can be trained end-to-end, facilitated by its loss function, known as the bipartite matching loss. All this was used through experiments on PubTables-1M and the significance of canonical data in improving performance.

The PubTables-1M Dataset

PubTables-1M is a contribution to the field of table extraction. It has been made from a collection of tables sourced from scientific articles. This dataset supports input formats and includes detailed header and location information for table modeling strategies, making it very good. A notable feature of PubTables-1M is its focus on addressing ground truth inconsistencies stemming from over-segmentation, improving the accuracy of annotations.

The PubTables | 1M Dataset | Images Using Transformers
Source: Smock et al. (2021)

The experiment of training the Table Transformer conducted with PubTables-1M showcased the effectiveness of the dataset. As noted earlier, transformer-based object detection, particularly the DETR model, exhibits exceptional performance across table detection, structure recognition, and functional analysis tasks. The results highlight the effectiveness of canonical data in improving model accuracy and reliability.

Canonicalization of the PubTables-1M Dataset

A crucial aspect of PubTables-1M is the innovative canonicalization process. This tackles over-segmentation in ground truth annotations, which can lead to ambiguity. By making assumptions about a table’s structure, the canonicalization algorithm corrects annotations, aligning them with a table’s logical organization. This enhances the reliability of the dataset and impacts performance.

Implementing an Inference Table Transformer

We will implement an inference with Table Transformer. We first install the transformers library from the Hugging Face repository. You can find the complete code for this article here. or https://github.com/inuwamobarak/detecting-tables-in-documents

!pip install -q git+https://github.com/huggingface/transformers.git

Next, we install ‘timm’, a popular library for models, training procedures, and utilities.

# Install the 'timm' library using pip
!pip install -q timm

Next, we can load an image on which we want to run the inference. I have added a costume dataset from my Huggingface repo. You can use it or adjust it to your data. I have provided a link to the GitHub repo for this code below and other original links.

# Import the necessary libraries
from huggingface_hub import hf_hub_download
from PIL import Image

# Download a file from the specified Hugging Face repository and location
file_path = hf_hub_download(repo_id="inuwamobarak/random-files", repo_type="dataset", filename="Screenshot from 2023-08-16 22-30-54.png")

# Open the downloaded image using the PIL library and convert it to RGB format
image = Image.open(file_path).convert("RGB")

# Get the original width and height of the image
width, height = image.size

# Resize the image to 50% of its original dimensions
resized_image = image.resize((int(width * 0.5), int(height * 0.5)))
Images Using Transformers

So, we will be detecting the table in the image above and recognizing the rows and columns.

Let us do some basic preprocessing tasks.

# Import the DetrFeatureExtractor class from the Transformers library
from transformers import DetrFeatureExtractor

# Create an instance of the DetrFeatureExtractor
feature_extractor = DetrFeatureExtractor()

# Use the feature extractor to encode the image
# 'image' should be the PIL image object that was obtained earlier
encoding = feature_extractor(image, return_tensors="pt")

# Get the keys of the encoding dictionary
keys = encoding.keys()

We will now load the table transformer from Microsoft on Huggingface.

# Import the TableTransformerForObjectDetection class from the transformers library
from transformers import TableTransformerForObjectDetection

# Load the pre-trained Table Transformer model for object detection
model = TableTransformerForObjectDetection.from_pretrained("microsoft/table-transformer-detection")
import torch

# Disable gradient computation for inference
with torch.no_grad():
    # Pass the encoded image through the model for inference
    # 'model' is the TableTransformerForObjectDetection model loaded previously
    # 'encoding' contains the encoded image features obtained using the DetrFeatureExtractor
    outputs = model(**encoding)

Now we can plot the result.

import matplotlib.pyplot as plt

# Define colors for visualization
COLORS = [[0.000, 0.447, 0.741], [0.850, 0.325, 0.098], [0.929, 0.694, 0.125],
          [0.494, 0.184, 0.556], [0.466, 0.674, 0.188], [0.301, 0.745, 0.933]]

def plot_results(pil_img, scores, labels, boxes):
    # Create a figure for visualization
    plt.figure(figsize=(16, 10))
    
    # Display the PIL image
    plt.imshow(pil_img)
    
    # Get the current axis
    ax = plt.gca()
    
    # Repeat the COLORS list multiple times for visualization
    colors = COLORS * 100
    
    # Iterate through scores, labels, boxes, and colors for visualization
    for score, label, (xmin, ymin, xmax, ymax), c in zip(scores.tolist(), labels.tolist(), boxes.tolist(), colors):
        # Add a rectangle to the image for the detected object's bounding box
        ax.add_patch(plt.Rectangle((xmin, ymin), xmax - xmin, ymax - ymin,
                                   fill=False, color=c, linewidth=3))
        
        # Prepare the text for the label and score
        text = f'{model.config.id2label[label]}: {score:0.2f}'
        
        # Add the label and score text to the image
        ax.text(xmin, ymin, text, fontsize=15,
                bbox=dict(facecolor='yellow', alpha=0.5))
    
    # Turn off the axis
    plt.axis('off')
    
    # Display the visualization
    plt.show()
# Get the original width and height of the image
width, height = image.size

# Post-process the object detection outputs using the feature extractor
results = feature_extractor.post_process_object_detection(outputs, threshold=0.7, target_sizes=[(height, width)])[0]

# Plot the visualization of the results
plot_results(image, results['scores'], results['labels'], results['boxes'])
 Detected Table | Images Using Transformers
Detected Table

So, we have successfully detected the tables but not recognized the rows and columns. Let us do that now. We will load another image for this purpose.

# Import the necessary libraries
from huggingface_hub import hf_hub_download
from PIL import Image

# Download the image file from the specified Hugging Face repository and location
# Use either of the provided 'repo_id' lines depending on your use case
file_path = hf_hub_download(repo_id="nielsr/example-pdf", repo_type="dataset", filename="example_table.png")
# file_path = hf_hub_download(repo_id="inuwamobarak/random-files", repo_type="dataset", filename="Screenshot from 2023-08-16 22-40-10.png")

# Open the downloaded image using the PIL library and convert it to RGB format
image = Image.open(file_path).convert("RGB")

# Get the original width and height of the image
width, height = image.size

# Resize the image to 90% of its original dimensions
resized_image = image.resize((int(width * 0.9), int(height * 0.9)))
 Sample Table for Recognition | Images Using Transformers
Sample Table for Recognition

Now, let us still prepare the above image.

# Use the feature extractor to encode the resized image
encoding = feature_extractor(image, return_tensors="pt")

# Get the keys of the encoding dictionary
keys = encoding.keys()

Next, we can still load the Transformer model as we did above.

# Import the TableTransformerForObjectDetection class from the transformers library
from transformers import TableTransformerForObjectDetection

# Load the pre-trained Table Transformer model for table structure recognition
model = TableTransformerForObjectDetection.from_pretrained("microsoft/table-transformer-structure-recognition")

with torch.no_grad():
  outputs = model(**encoding)

Now we can visualize our results.

# Create a list of target sizes for post-processing
# 'image.size[::-1]' swaps the width and height to match the target size format (height, width)
target_sizes = [image.size[::-1]]

# Post-process the object detection outputs using the feature extractor
# Use a threshold of 0.6 for confidence
results = feature_extractor.post_process_object_detection(outputs, threshold=0.6, target_sizes=target_sizes)[0]

# Plot the visualization of the results
plot_results(image, results['scores'], results['labels'], results['boxes'])
 Recognised rows and columns
Recognised rows and columns

There we have it. Try out your tables and see how it goes. Please follow me on GitHub and my socials for more interesting tutorials with Transformers. Also, leave a comment below if you find this helpful.

Conclusion

The possibilities for uncovering insights from unstructured information are brighter than ever before. One major success of table detection is the introduction of the PubTables-1M dataset and the concept of canonicalization. We have seen table extraction and the innovative solutions that have reshaped the field. Seeing canonicalization as a novel approach to ensuring consistent ground truth annotations that addressed over-segmentation. Aligning annotations with the structure of tables has elevated the dataset’s reliability and accuracy, paving the way for robust model performance.

Key Takeaways

  • The PubTables-1M dataset revolutionizes table extraction by providing an array of annotated tables from scientific articles.
  • The innovative concept of canonicalization tackles the challenge of ground truth inconsistency.
  • Transformer-based object detection models, particularly the Detection Transformer (DETR) excel in table detection, structure recognition, and functional analysis tasks.

Frequently Asked Questions

Q1: What is object detection using DETR?

A1: Detection Transformer is a set-based object detector using a Transformer on top of a convolutional backbone using a conventional CNN to learn a 2D representation of an input image. The model flattens and supplements it with a positional encoding before passing it into a transformer encoder.

Q2: What is the role of the CNN backbone in Detr?

A2: The CNN backbone processes the input image and extracts high-level features crucial for recognizing objects. These features are then fed into the Transformer encoder for further analysis.

Q3: What’s unique about Detr’s approach?

A3: Detr replaces the traditional region proposal network (RPN) with a set-based approach. It treats object detection as a permutation problem, enabling it to handle varying numbers of objects efficiently without needing anchor boxes.

Q4: Which is better, Yolo or DETR, for object detection?

A4: Real-Time Detection Transformer (RT-DETR) is a real-time end-to-end object detector that leverages novel IoU-aware query selection to address inference speed delay issues. RT-DETR, for instance, outperforms YOLO object detectors in accuracy and speed.

Q5: What is a transformer in object detection?

A5: DEtection TRansformer (DETR) presents transformers to object detection by reframing detection as a set prediction problem while eliminating the need for proposal generation and post-processing steps.

References

  • GitHub repo: https://github.com/inuwamobarak/detecting-tables-in-documents
  • Smock, B., Pesala, R., & Abraham, R. (2021). PubTables-1M: Towards comprehensive table extraction from unstructured documents. ArXiv. /abs/2110.00061
  • https://arxiv.org/abs/2110.00061
  • Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020). End-to-End Object Detection with Transformers. ArXiv. /abs/2005.12872
  • https://huggingface.co/docs/transformers/model_doc/detr
  • https://huggingface.co/docs/transformers/model_doc/table-transformer
  • https://huggingface.co/microsoft/table-transformer-detection
  • https://huggingface.co/microsoft/table-transformer-structure-recognition

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

I am an AI Engineer with a deep passion for research, and solving complex problems. I provide AI solutions leveraging Large Language Models (LLMs), GenAI, Transformer Models, and Stable Diffusion.

Responses From Readers

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details