Getting Started with Amazon SageMaker Ground Truth

Shikha Last Updated : 07 Jul, 2023

7 min read

Introduction

In this era of Generative Al, data generation is at its peak. Building an accurate machine learning and AI model requires a high-quality dataset. The quality assurance of the dataset is the most critical task, as poor data causes inaccurate analytics and unidentified predictions that can affect the entire repo of any business and make a loss of billions or trillions of amount.

https://www.forbes.com/sites/moorinsights/2021/12/03/amazon-sagemaker--the-easiest-way-to-build-artificial-intelligence-models-became-even-easier/ — Source: Forbes

Data labeling is the first step towards data quality assurance that makes it understandable for AI models. Nobody can rely on humans to label data as humans can’t label the unlimited/every day generating data, so here we learn about Amazon SageMaker ground truth, a fantastic technique to create an accurately labeled dataset.

This article was published as a part of the Data Science Blogathon.

Introduction
What is Amazon SageMaker Ground Truth?
Use cases of Amazon SageMaker Ground Truth
Automated Data Labeling via Ground Truth
Impact of Amazon SageMaker Ground Truth to Increase the Accuracy
- 1. Annotation Consolidation
- 2. Best Practices on Annotation Interface
Conclusion
Frequently Asked Questions

What is Amazon SageMaker Ground Truth?

Amazon SageMaker Ground Truth is a self-service offering that makes creating an efficient and highly accurate dataset accessible by performing data labeling tasks. Ground Truth also offers you to use human annotators through third-party vendors, Amazon Mechanical Turk, or even our private workforce, and a managed experience to set up end-to-end labeling jobs.

https://www.edlitera.com/blog/posts/amazon-sagemaker-tutorial — Source: Edlitera.com

SageMaker Ground Truth can generate millions of automatically labeled synthetic data without any manual effort of data collection or labeling on our behalf. Ground Truth offers a data labeling facility for various data types, including images, text, and videos. It helps the machine learning models to ease the task of text classifications, segment segmentation, object detection, and image classification.

Building ML Model in AWS SageMaker

Use cases of Amazon SageMaker Ground Truth

Here are some industry use cases of SageMaker Ground Truth:

Autonomous Vehicles: A large amount of labeled data is needed by training models for autonomous vehicles. SageMaker Ground Truth can annotate objects, such as cars, pedestrians, traffic signs, and road markings, to develop accurate perception models and helps with safe autonomous driving.
Healthcare: Label Medical imaging datasets using SageMaker Ground Truth to train models for diagnosing and identifying diseases like cancer, brain tumors, and other abnormalities. It can also transcribe and annotate medical records for natural language processing (NLP) applications.
Manufacturing: Labeling images and sensor data in manufacturing processes can help in quality control, defect detection, predictive maintenance, and optimizing production efficiency.

The flexibility of SageMaker Ground Truth enables its application across multiple industries where labeled datasets are required for training and improving machine learning models.

Automated Data Labeling via Ground Truth

Amazon SageMaker Ground Truth is the application of machine learning algorithms, it uses the concept of Active Learning to label the data automatically and accurately. Active learning is a type of machine learning technique used to identify complex data that the machine cannot understand in the first go, it extracts that data and send it out to the human for labeling. Let’s discuss the working of Ground Truth!

https://www.linkedin.com/pulse/efficient-accurate-data-labeling-amazon-sagemaker-milad-rezaeighale — Source: LinkedIn

Step 1: Data Storage

Collect the raw and unlabelled data from different sources and store it in the S3 bucket.

https://sagemaker-examples.readthedocs.io/en/latest/end_to_end/fraud_detection/1-data-prep-e2e.html — Source: Sagemaker

Step 2: Sending Data to Human

In this step, pick a random piece of a dataset and send it to the human for manual data labeling.

https://www.marktechpost.com/2022/09/28/a-primer-on-data-labeling-approaches-to-building-real-world-machine-learning-applications/ — Source: Marktechpost.com

Step 3: Human Labeling

As soon as the workers received the data chunk, they started labeling it.

https://medium.com/anolytics/what-is-data-annotation-and-what-are-its-advantages-95766213351e

Step 4: Label Consolidation Algorithm

Amazon Sagemaker Ground Truth uses this label Consolidation Algorithm to eliminate the risk of human errors and improve the accuracy of labeled datasets. The working of the algorithm includes gathering all labels for each data point in the dataset followed by consolidating them into single labels depending upon the weight of the labels.

https://www.geeksforgeeks.org/sagemaker-exploring-ground-truth-labeling-ml/

Step 5: Resultant Dataset

Now, we stored the resultant dataset, a small labeled dataset.

Step 6: Amazon Sagemaker Model

Now we create a self-learning model based on the machine learning algorithms and install that with the customer account in order to train the model from the small labeled dataset the customer is creating so that it will label the rest of the unlabelled data on its own.

Step 7: Use the ML Model

In this step, we’re using the newly created ML model to label the unlabelled data points of the original dataset.

Step 8: Automated Labeling

Automated Labeling is applied to the remaining Dataset with the help of the Active Learning method.

Step 9: High Confidence

Here we check the confidence score of the model, and we apply the automated annotation only if the score of our model is high.

Step 10: Low Confidence

If the confidence score of the model is low, we can’t apply the automated annotation, and we will then send that portion of the data to humans for the sake of labeling. However, the model will automatically create a new dataset to train and improve its accuracy in this case.

The entire dataset undergoes a cycle of repeating these steps until it is fully labeled.

Impact of Amazon SageMaker Ground Truth to Increase the Accuracy

Sagemaker basically proposes two methods to enhance the training data accuracy:

1. Annotation Consolidation

The purpose of annotation Consolidation is to counteract the error/bias of each worker by sending each data object to two or more workers and then consolidating their responses into a single label for our data objects.

https://aws.amazon.com/blogs/machine-learning/annotate-data-for-less-with-amazon-sagemaker-ground-truth-and-automated-data-labeling/ — Source: Amazon

After collecting data from various workers, it applies the consolidation algorithm to compare them.

Algorithm

Detect the outlier annotations that are disregarded.
Applies a weighted consolidation of the annotations by assigning higher weights to more reliable annotations.
The label assigned to each object in the dataset is a probabilistic estimate of a true label. The object may have multiple annotations, but the output is a single label for each object.
Although we can choose the number of workers to perform annotation, which will increase the accuracy of our labels, the issue is that it will also increase the labeling cost.

The annotation Consolidation function offered by Ground Truth applies to all predefined labeling tasks, including NER( name entity recognition), bounding box, semantic segmentation, and image and text classification. Let’s understand each function!

Named Entity Recognition(NER): The Jaccard similarity is used for cluster text selections in NER. It took the mode of the label to calculate selection boundaries, and if the mode is unclear, it will go with a label median. At last random selection will play the role of this breaker to resolve the most assigned entity label in the cluster.
Bounding Box Annotation: In bounding box annotation, the consolidation task is performed by grabbing the bounded boxes from various workers and selecting the most similar ones via the Jaccard index, or intersection over union, of the boxes and averaging them.
Multi-class Annotation Consolidation for Image and Text Classification: The consolidation is performed by estimating the true class depending upon the class annotations from separate workers via Bayesian inference.
Semantic Segmentation Annotation: The system considers each pixel of an image as a multi-class object and treats the pixel annotations from workers as “votes.” Additionally, it incorporates extra information from surrounding pixels by applying a smoothing function to the image.

2. Best Practices on Annotation Interface

The annotation Interface has various features to improve the accuracy or quality of human labeling tasks. This well-organized and designed interface help worker obtain an adequate dataset with minimal error. The best practices include displaying brief instructions on a fixed-side panel and excellent and bad-label examples. Also, it has a feature to highlight only the image boundary for the bounding box annotations by darkening the background.

Conclusion

We discussed how Amazon Sagemaker Ground Truth will help to generate high-quality datasets for the machine learning model. The key takeaways of this Ground Truth blog include the following:

Data labeling is the first step towards data quality assurance that makes it understandable for AI models.
It can generate millions of automatically labeled synthetic data without any manual effort of data collection or labeling on our behalf.
Annotation Consolidation and Best Practices on Annotation Interface are two ways Sagemaker can enhance training data accuracy.

Frequently Asked Questions

Q1. What do you mean by Amazon SageMaker Ground Truth?

A. A highly managed data labeling service that efficiently creates high-quality labeled datasets for training models. It combines automated labeling through machine learning and human review to deliver highly accurate annotations.

Q2. Explain the working of SageMaker Ground Truth.

A. SageMaker Ground Truth uses a combination of automated and manual annotation techniques. It provides a web-based interface for human reviewers to annotate data based on predefined labeling tasks. The service also incorporates options for active learning, where it trains models on labeled data to propose labels for the remaining unlabeled data, thereby enhancing annotation efficiency.

Q3. Which types of data can SageMaker Ground Truth annotate?

A. SageMaker Ground Truth supports various data types, including images, text, audio, and video. It provides annotation tools for each data type, enabling accurate labeling for different use cases.

Q4. Can SageMaker Ground Truth integrate with other AWS services?

A. Yes, SageMaker Ground Truth seamlessly integrates with other AWS services. Use Amazon S3 for storing data, Amazon Mechanical Turk for sourcing human reviewers, and Amazon Rekognition for automated image and video analysis.

Q5. Explain how does SageMaker Ground Truth ensure the quality of labeled data.

A. SageMaker Ground Truth employs multiple mechanisms to ensure high-quality annotations. It includes features like review workflows, built-in annotation consolidation, and active learning to minimize errors and improve the accuracy of labeled datasets.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Shikha

I am a tech enthusiast, a student, and a learner. I am a critical reader and a lover of words who finds writing blogs interesting. I possess the capability to research and learn new technologies quickly.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Getting Started with Amazon SageMaker Ground Truth

Introduction

Table of contents

What is Amazon SageMaker Ground Truth?

Use cases of Amazon SageMaker Ground Truth

Automated Data Labeling via Ground Truth

Step 1: Data Storage

Step 2: Sending Data to Human

Step 3: Human Labeling

Step 4: Label Consolidation Algorithm

Step 5: Resultant Dataset

Step 6: Amazon Sagemaker Model

Step 7: Use the ML Model

Step 8: Automated Labeling

Step 9: High Confidence

Step 10: Low Confidence

Impact of Amazon SageMaker Ground Truth to Increase the Accuracy

1. Annotation Consolidation

Algorithm

2. Best Practices on Annotation Interface

Conclusion

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state