Machine Learning: Adversarial Attacks and Defense

Guest Blog Last Updated : 01 Sep, 2022

8 min read

Introduction

Adversarial machine learning is a growing threat in the AI and machine learning research community. The most common reason is to cause a malfunction in a machine learning model; an adversarial attack might entail presenting a model with inaccurate or misrepresentative data as its training or introducing maliciously designed data to deceive an already trained model.

Before diving deeper into Adversarial Attacks, these attacks can be considered a very acute version of Anomalies in the dataset, directed maliciously from the get to affect a machine learning model. To understand better, most machine learning techniques are primarily designed to work on specific problem sets, assuming that the training and test data are generated from the same statistical distribution. Still, sometimes this assumption can be exploited by some users deliberately to mess up your MLOps pipeline.

Users often use these attacks to manipulate your model’s performance, affecting your product and reputation. So let us dive deeper into these attacks and how these can be dealt with.

Adversarial Attacks on AI/ML

Adversarial attacks on machine learning require that augmentation and additions be introduced in the model pipeline, especially when the model holds a vital role in situations where the error window is very narrow. For example, an adversarial attack could involve feeding a model false or misleading data while training or adding maliciously prepared data to trick an already trained model.

To get an idea of what adversarial examples look like, consider this demonstration: starting with an image of a panda, the attacker adds a small perturbation that has been calculated to make the image recognized as a gibbon with high confidence.

To break this down, what the said “perturbation” has done to the panda image is that it has taken into account how the feature extractor in the model will filter the image and effectively change or influence the values of those specific pixels to classify the image wrongly completely.

Some more examples where these attacks can destroy the pipeline can be seen in the Automation Industry, where something like putting wrong stickers on the street can off-put an autonomous vehicle and confuse the decision-making module for horrible outcomes. Or how about a fundamental error we can encounter daily in India, stickers and posters on vital traffic signs?

Types of Adversarial Attacks

These attacks can be further divided into two extensive categories, which can help with the initial analysis of the problem at hand and help your engineering team greatly. Black Box and White Box attacks are the initial measures of adversarial attacks on an AI system.

In black box attacks, the attacker does not have access to the model’s parameters, so it employs a different model or none to generate adversarial images in the hopes that these will transfer to the target model. In contrast, the attacker does have access to the model’s parameters in white box attacks.

Some attacks are widely encountered in the field before diving deeper into the specific episodes set by Black Box, and White Box attacks are:

Data Poisoning

Poisoning is the contamination of the training dataset. Given that datasets impact learning algorithms, poisoning possibly holds the potential to reprogram algorithms. Serious concerns have been highlighted, particularly about user-generated training data, such as for content recommendation or natural language models, given the prevalence of false accounts.

Byzantine Attacks

In the day and age of edge computing, more and more models are being trained on multiple devices simultaneously due to the requirement collaboration that is required continuously with the head server. However, if some of these devices act up unnaturally, it may affect the model at its core, just in the training phase. These attacks can also be encountered even when we use only one device, possibly due to vulnerability and the single point of failure.

Evasion

Evasion attacks include taking advantage of a trained model’s flaw. In addition, spammers and hackers frequently try to avoid detection by obscuring the substance of spam emails and malware. For example, samples are altered to avoid detection and hence classified as authentic. Image-based spam is a prime example of evasion, where the spam content is embedded within an attached image to avoid textual examination by anti-spam filters. Spoofing attacks on biometric verification systems are another type of evasion.

Model Extraction

An adversary probes a black box machine learning system to get the data it was trained on. When the training data or the model itself is sensitive and secret, this can present problems. Model extraction, for example, might be used to extract a proprietary stock trading model that the enemy could then employ to their financial advantage. In the worst-case scenario, model extraction can result in model stealing, which is the extraction of enough data from the model to allow for the complete rebuilding of the model.

Diving Deeper into Black Box and White Box Attacks

At its core, adversarial attacks are those malicious attacks on the data which may seem okay to a human eye but causes misclassification in a machine learning pipeline. These attacks are often made in the form of specially designed “noise,” which can elicit misclassification.

Let us look at the two major types of attacks that come under Adversarial attacks:

Black Box Attacks

In adversarial machine learning, black box attacks assume that the adversary can only acquire outputs for given inputs and does not know the model structure or parameters. The adversarial example is constructed in this case either with a model created from scratch or without any model (excluding the ability to query the original model). In either instance, these attacks aim to generate adversarial examples that can be transferred to the black box model under consideration.

Let us look at some attacks in black box attacks:

Square Attacks

This is based on a random search that picks localized square-shaped updates at unexpected places. This ensures that the adversarial change to the image at each iteration is close to the edge of the pixels that act as the significant points responsible for classification.

To increase query efficiency, the technique perturbs only a tiny square part of pixels in each phase, hence the name Square Attack, which ends as soon as an adversarial sample is identified. Finally, because the attack algorithm employs scores rather than gradient information, the authors of the research claim that this strategy is unaffected by gradient masking, a previously utilized technique to avoid evasion assaults.

According to the paper’s authors, the proposed Square Attack required fewer queries than when compared to state-of-the-art score-based black-box attacks at the time. In theory, the result is an adversarial example that is very sure belongs to the wrong class but looks like the original image.

HopSkipJump Attack

This black box attack was also proposed as an efficient query approach, although it relies entirely on access to the regular output class of any input. In other words, unlike the Square Attack, the HopSkipJump attack requires only the model’s class prediction output. It does not require the capacity to calculate gradients or access to score values (for any given input).

Contrary to other Black Box Attack methods, this attack held the advantage by not having barriers like masked gradients, stochastic gradients, and non-differentiability.

All known decision-based algorithms, including HopSkipJumpAttack, have the limitation of requiring the target model to be evaluated near the boundary. Therefore, they may not work successfully by limiting searches near the border or broadening the decision boundary by adding an “unknown” class for inputs with low confidence.

White Box Attacks

White box attacks are based on the assumption that the adversary can access the model’s parameters and obtain labels for the inputs provided. These attacks are more targeted than normal black box attacks, where the attacker is just trying to see whether the data corruption is affecting the model.

Let us look at some White Box Attacks:

Fast Gradient Sign Method

Google researchers Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy developed one of the first attacks for producing adversarial examples. The assault was referred to as the fast gradient sign method. It consisted of adding a linear amount of undetectable noise to a picture to falsely cause a model to identify it. This noise is calculated by multiplying by a small constant epsilon the sign of the gradient concerning the image we wish to affect.

In its most basic form, FGSM entails the addition of noise (not random noise), the direction of which corresponds to the same gradient as that of the cost function concerning the data.

One huge advantage of FGSM is its comparably efficient computing time; contrastingly, the disadvantage is that the perturbations are added to every single feature of the image.

Carlini and Wagner

The technique is based on the L-BFGS assault (optimization problem), but it does not have box limitations and uses other goal functions. The Limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) method is a nonlinear gradient-based numerical optimization methodology used to reduce the number of perturbations applied to images. L-BFGS, on the other hand, is a time-consuming and inefficient procedure.

This improves the method’s efficiency in creating adversarial samples; it has been demonstrated to defeat cutting-edge defenses such as defensive distillation and adversarial training.

This method is one of the few attacks known to affect models. However, one huge disadvantage of the model is its computational complexity and intensity against techniques like FGSM, JSMA, and Deepfool.

Protecting Machine Learning Systems Against Adversarial Attacks

Adversarial training can successfully defend models in specific settings; this defense strategy augments a supervised model’s training data with adversarial cases, helping the models to identify them better. By training on both clean and adversarial data, this method attempts to reduce the risk provided by adversarial examples.

Training the model against adversarial attacks can be a bit tedious, but there can be some steps taken during the Machine Learning pipeline by the MLOps team; let us check out some of them:

Threat modeling – Formalize the attacker’s goals and capabilities to the target system.
Attack simulation – Formalize the optimization problem the attacker tries to solve according to possible attack strategies.
Attack impact evaluation
Countermeasure design
Noise detection (For evasion based attack)
Information laundering – Alter the information received by adversaries (for model stealing attacks)

Conclusion

Adversarial machine learning is a new and growing research field that presents many complex problems across the fields of AI and ML. Are we in danger of adversaries exploiting our machine learning models with adversarial attacks? Currently, it is difficult to determine completely, and most importantly, there are no silver bullets for defending models against adversarial attacks. Many techniques and strategies are being explored in machine learning and AI. The future will likely hold some solutions to protect from adversarial attacks.

References

[img-2] – https://arxiv.org/pdf/1412.6572.pdf

[img-3] – https://medium.com/cltc-bulletin/adversarial-machine-learning-43b6de6aafdb

[img-4] – https://www.thewolfofallstreets.io/bitcoin-and-the-byzantine-generals-problem/

[img-5] – https://hackernoon.com/adversarial-machine-learning-a-beginners-guide-to-adversarial-attacks-and-defenses

[img-6] – https://www.researchgate.net/figure/Diagram-of-ML-model-extraction-attacks-A-data-owner_fig2_308027534

[img-7] – https://www.davidwong.fr/blockbreakers/square_2_attack4rounds.html

[img-8] – https://www.researchgate.net/figure/Intuitive-explanation-of-HopSkipJumpAttack-a-Perform-a-binary-search-to-find-the_fig1_343339153

[img-9] – https://pyimagesearch.com/2021/03/01/adversarial-attacks-with-fgsm-fast-gradient-sign-method/

[img-10] – https://www.skillsire.com/read-blog/359_a-overview-on-adversarial-attacks-and-defenses.html?mode=night

Guest Blog

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to Computer Vision

Getting Started with Image Data

Introduction to CNN and Implementation

Introduction to CNN and implementation

Introduction to Transfer Learning

CNN Visualization

Overview of Pretrained Models

Inception

ResNets

DenseNets

CSRNet

Introduction to Object Detection

Region Based Convolutional Neural Network

Single Stage Networks

Transformed Based Object Detection Models

Face Detection

Object Tracking

Pose Estimation

Introduction to Image Segmentation

Understanding Deep Learning Architectures for Image Segmentation

Video Classification

Introduction to Image Generation

Experiments with Generative Adversarial Networks

Zero and Few Shot Learning

Model Deployment

Machine Learning: Adversarial Attacks and Defense

Introduction

Adversarial Attacks on AI/ML

Types of Adversarial Attacks

Data Poisoning

Byzantine Attacks

Evasion

Model Extraction

Diving Deeper into Black Box and White Box Attacks

Black Box Attacks

Square Attacks

HopSkipJump Attack

White Box Attacks

Fast Gradient Sign Method

Carlini and Wagner

Protecting Machine Learning Systems Against Adversarial Attacks

Conclusion

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)