Support Vector Machine (SVM) Algorithm

Anshul Last Updated : 04 Feb, 2025
18 min read

SVM (Support Vector Machine) is a powerful supervised algorithm, effective for both regression and classification, though it excels in classification tasks. Popular since the 1990s, it performs well on smaller or complex datasets with minimal tuning. Before diving into SVM, ensure you’re familiar with Decision Trees, Random Forest, Naïve Bayes, K-nearest neighbor, and Ensemble Modeling. This article will explain what SVM is, how it works, and the math behind this essential ML algorithm, building on the SVM concepts previously discussed.

This article was published as a part of the Data Science Blogathon!

What is a Support Vector Machine(SVM)?

A Support Vector Machine (SVM) is a machine learning algorithm used for classification and regression. It finds the best line (or hyperplane) to separate data into groups, maximizing the distance between the closest points (support vectors) of each group. It can handle complex data using kernels to transform it into higher dimensions. In short, SVM helps classify data effectively.

Types of Support Vector Machine (SVM) Algorithms

  • Linear SVM: When the data is perfectly linearly separable only then we can use Linear SVM. Perfectly linearly separable means that the data points can be classified into 2 classes by using a single straight line(if 2D).
  • Non-Linear SVM: When the data is not linearly separable then we can use Non-Linear SVM, which means when the data points cannot be separated into 2 classes by using a straight line (if 2D) then we use some advanced techniques like kernel tricks to classify them. In most real-world applications we do not find linearly separable datapoints hence we use kernel trick to solve them.

Logistic Regression VS Support Vector Machine (SVM)

Logistic Regression

  1. Probabilistic Approach:
    Logistic Regression predicts the probability that an input belongs to a specific class (e.g., 80% chance of being “spam”). It uses the sigmoid function to map inputs to probabilities between 0 and 1.
  2. Linear Decision Boundary:
    It assumes the data can be separated by a straight line (or a hyperplane in higher dimensions). If the data isn’t linearly separable, Logistic Regression may struggle.
  3. Simple and Interpretable:
    It’s easy to implement and interpret, making it a good starting point for classification problems. The coefficients of the model can also tell you how each feature influences the outcome.
  4. Works Well for Linearly Separable Data:
    Logistic Regression performs well when the relationship between the input features and the output is linear.
  5. Efficient for Large Datasets:
    It’s computationally efficient and scales well to large datasets, making it a popular choice for many real-world applications.

Support Vector Machine (SVM)

  1. Maximizes the Margin:
    SVM focuses on finding the decision boundary that maximizes the margin (the distance between the boundary and the closest data points of each class). This makes it more robust to new data.
  2. Handles Non-Linear Data:
    SVM can handle non-linear data using the “kernel trick,” which transforms the data into a higher-dimensional space where it becomes easier to separate.
  3. Effective in High Dimensions:
    SVM works well even when the number of features (dimensions) is much larger than the number of samples, making it suitable for complex datasets.
  4. Robust to Overfitting:
    By focusing on the points closest to the boundary (support vectors), SVM is less likely to overfit, especially in smaller datasets.
  5. Requires Tuning:
    SVM requires careful tuning of parameters (like the choice of kernel and regularization) to achieve optimal performance, which can be time-consuming.

How Does Support Vector Machine Algorithm Work?

SVM is defined such that it is defined in terms of the support vectors only, we don’t have to worry about other observations since the margin is made using the points which are closest to the hyperplane (support vectors), whereas in logistic regression the classifier is defined over all the points. Hence SVM enjoys some natural speed-ups.

Let’s understand the working of SVM using an example. Suppose we have a dataset that has two classes (green and blue). We want to classify that the new data point as either blue or green.

Data Points

To classify these points, we can have many decision boundaries, but the question is which is the best and how do we find it?

NOTE: Since we are plotting the data points in a 2-dimensional graph we call this decision boundary a straight line but if we have more dimensions, we call this decision boundary a “hyperplane”

Adding Decision boundries to our seperate Blue and Green datapoints

The best hyperplane is that plane that has the maximum distance from both the classes, and this is the main aim of SVM. This is done by finding different hyperplanes which classify the labels in the best way then it will choose the one which is farthest from the data points or the one which has a maximum margin.

Checking the margin | Support Vector Machine(SVM)

Advantages of Support Vector Machine

  1. Works well with complex data: SVM is great for datasets where the separation between categories is not clear. It can handle both linear and non-linear data effectively.
  2. Effective in high-dimensional spaces: SVM performs well even when there are more features (dimensions) than samples, making it useful for tasks like text classification or image recognition.
  3. Avoids overfitting: SVM focuses on finding the best decision boundary (margin) between classes, which helps in reducing the risk of overfitting, especially in high-dimensional data.
  4. Versatile with kernels: By using different kernel functions (like linear, polynomial, or radial basis function), SVM can adapt to various types of data and solve complex problems.
  5. Robust to outliers: SVM is less affected by outliers because it focuses on the support vectors (data points closest to the margin), which helps in creating a more generalized model.

Disadvantages of Support Vector Machine

  1. Slow with large datasets: SVM can be computationally expensive and slow to train, especially when the dataset is very large.
  2. Difficult to tune: Choosing the right kernel and parameters (like C and gamma) can be tricky and often requires a lot of trial and error.
  3. Not suitable for noisy data: If the dataset has too many overlapping classes or noise, SVM may struggle to perform well because it tries to find a perfect separation.
  4. Hard to interpret: Unlike some other algorithms, SVM models are not easy to interpret or explain, especially when using non-linear kernels.
  5. Memory-intensive: SVM requires storing the support vectors, which can take up a lot of memory, making it less efficient for very large datasets.

Mathematical Intuition: Support Vector Machine (SVM)

Many people skip the math behind SVMs because it’s complex. Here, we’ll focus on the key steps needed to implement the algorithm, avoiding deep theoretical concepts like primal/dual formulations or Lagrange multipliers, which are more relevant for research. Let’s dive into the practical workings of SVMs.

Before getting into the nitty-gritty details of this topic first let’s understand what a dot product is.

Understanding Dot-Product

We all know that a vector is a quantity that has magnitude as well as direction and just like numbers we can use mathematical operations such as addition, multiplication. In this section, we will try to learn about the multiplication of vectors which can be done in two ways, dot product, and cross product. The difference is only that the dot product is used to get a scalar value as a resultant whereas cross-product is used to obtain a vector again.

The dot product can be defined as the projection of one vector along with another, multiply by the product of another vector.

Dot product of two vectors

Here a and b are 2 vectors, to find the dot product between these 2 vectors we first find the magnitude of both the vectors and to find magnitude we use the Pythagorean theorem or the distance formula.

After finding the magnitude we simply multiply it with the cosine angle between both the vectors. Mathematically it can be written as:

A . B = |A| cosθ * |B|

  • Where |A| cosθ is the projection of A on B
  • And |B| is the magnitude of vector B

Now in SVM we just need the projection of A not the magnitude of B, I’ll tell you why later. To just get the projection we can simply take the unit vector of B because it will be in the direction of B but its magnitude will be 1. Hence now the equation becomes:

A.B = |A| cosθ * unit vector of B

Now let’s move to the next part and see how we will use this in SVM.

Use of Dot Product in SVM

Consider a random point X and we want to know whether it lies on the right side of the plane or the left side of the plane (positive or negative).

Use of Dot Product in Support Vector Machine

To find this first we assume this point is a vector (X) and then we make a vector (w) which is perpendicular to the hyperplane. Let’s say the distance of vector w from origin to decision boundary is ‘c’. Now we take the projection of X vector on w.

Use of Dot Product in Support Vector Machine

We already know that projection of any vector or another vector is called dot-product. Hence, we take the dot product of x and w vectors. If the dot product is greater than ‘c’ then we can say that the point lies on the right side. If the dot product is less than ‘c’ then the point is on the left side and if the dot product is equal to ‘c’ then the point lies on the decision boundary.

Formula for dot product

Reason for Perpendicular Vector (w) in SVM:

  • A common doubt is why we choose the perpendicular vector w to the hyperplane.
  • The goal is to measure the distance of a vector X from the decision boundary.
  • Since there are infinite points on the boundary, measuring distance from all of them is impractical.
  • To standardize, we use the perpendicular vector w as a reference.
  • We project all other data points onto this perpendicular vector and compare their distances.

In SVM we also have a concept of margin. In the next section, we will see how we find the equation of a hyperplane and what exactly do we need to optimize in SVM.

Margin in Support Vector Machine

We all know the equation of a hyperplane is w.x+b=0 where w is a vector normal to hyperplane and b is an offset.

Margin in Support Vector Machine(SVM)

To classify a point as negative or positive we need to define a decision rule. We can define decision rule as:

Decision Rule in SVM

If the value of w.x+b>0 then we can say it is a positive point otherwise it is a negative point. Now we need (w,b) such that the margin has a maximum distance. Let’s say this distance is ‘d’.

Maximum Margin | Support Vector Machine

To calculate ‘d’ we need the equation of L1 and L2. For this, we will take few assumptions that the equation of L1 is w.x+b=1 and for L2 it is w.x+b=-1.

Now the question comes

  • Why the magnitude is equal, why didn’t we take 1 and -2?
  • Why did we only take 1 and -1, why not any other value like 24 and -100?
  • Why did we assume this line?

Let’s try to answer these questions

  • We want our plane to have equal distance from both the classes that means L should pass through the center of L1 and L2 that’s why we take magnitude equal.
  • Let’s say the equation of our hyperplane is 2x+y=2, we observe that even if we multiply the whole equation with some other number the line doesn’t change (try plotting on a graph). Hence for mathematical convenience, we take it as 1.
  • Now the main question is exactly why there’s a need to assume only this line? To answer this, I’ll try to take the help of graphs.

Suppose the equation of our hyperplane is 2x+y=2:

Plot for equation: 2x+y=2

Let’s create margin for this hyperplane,

Adding hyperplanes

If you multiply these equations by 10, we will see that the parallel line (red and green) gets closer to our hyperplane. For more clarity look at this graph.

We also observe that if we divide this equation by 10 then these parallel lines get bigger. Look at this graph.

By this I wanted to show you that the parallel lines depend on (w,b) of our hyperplane, if we multiply the equation of hyperplane with a factor greater than 1 then the parallel lines will shrink and if we multiply with a factor less than 1, they expand.

We can now say that these lines will move as we do changes in (w,b) and this is how this gets optimized. But what is the optimization function? Let’s calculate it.

We know that the aim of SVM is to maximize this margin that means distance (d). But there are few constraints for this distance (d). Let’s look at what these constraints are.

Optimization Function and its Constraints

In order to get our optimization function, there are few constraints to consider. That constraint is that “We’ll calculate the distance (d) in such a way that no positive or negative point can cross the margin line”. Let’s write these constraints mathematically:

Representing the constraints mathematically

Rather than taking 2 constraints forward, we’ll now try to simplify these two constraints into 1. We assume that negative classes have y=-1 and positive classes have y=1.

We can say that for every point to be correctly classified this condition should always be true:

Condition | Support Vector Machine

Suppose a green point is correctly classified that means it will follow w.x+b>=1, if we multiply this with y=1 we get this same equation mentioned above. Similarly, if we do this with a red point with y=-1 we will again get this equation. Hence, we can say that we need to maximize (d) such that this constraint holds true.

  • We will select 2 support vectors:
    • 1 from the negative class.
    • 1 from the positive class.
  • The distance between these two vectors, x1x1​ and x2x2​, will be represented as the vector (x2−x1)(x2​−x1​).
  • Our goal is to find the shortest distance between these two points.
  • This can be achieved using a trick from the dot product:
    • Take a vector ww that is perpendicular to the hyperplane.
    • Find the projection of the vector (x2−x1)(x2​−x1​) onto ww.

Note: this perpendicular vector should be a unit vector then only this will work. Why this should be a unit vector? This has been explained in the dot-product section. To make this ‘w’ a unit vector we divide this with the norm of ‘w’.

Adding vector w | Support Vector Machine (SVM)

Finding Projection of a Vector on Another Vector Using Dot Product

We already know how to find the projection of a vector on another vector. We do this by dot-product of both vectors. So let’s see how

Dot product of two vectors: (x2-x1) and w

Since x2 and x1 are support vectors and they lie on the hyperplane, hence they will follow yi* (2.x+b)=1 so we can write it as:

Equation 2 and 3

Putting equations (2) and (3) in equation (1) we get:

Putting equations (2) and (3) in equation (1)

Hence the equation which we have to maximize is:

Equation which we need to maximize

We have now found our optimization function but there is a catch here that we don’t find this type of perfectly linearly separable data in the industry, there is hardly any case we get this type of data and hence we fail to use this condition we proved here. The type of problem which we just studied is called Hard Margin SVM now we shall study soft margin which is similar to this but there are few more interesting tricks we use in Soft Margin SVM.

Soft Margin SVM

  • Real-world datasets are rarely perfectly linearly separable; they are often nearly or completely non-linearly separable.
  • Methods for linearly separable data don’t work for these cases.
  • Support Vector Machines (SVM) are a powerful machine learning tool.
  • SVMs can handle both nearly and non-linearly separable datasets.
  • They provide effective solutions for classification problems in diverse real-world scenarios.

To tackle this problem what we do is modify that equation in such a way that it allows few misclassifications that means it allows few points to be wrongly classified.

We know that max[f(x)] can also be written as min[1/f(x)], it is common practice to minimize a cost function for optimization problems; therefore, we can invert the function.

Soft margin Support Vector Machine equation

To make a soft margin equation we add 2 more terms to this equation which is zeta and multiply that by a hyperparameter ‘c’

Soft margin Support Vector Machine equation

For all the correctly classified points our zeta will be equal to 0 and for all the incorrectly classified points the zeta is simply the distance of that particular point from its correct hyperplane that means if we see the wrongly classified green points the value of zeta will be the distance of these points from L1 hyperplane and for wrongly classified redpoint zeta will be the distance of that point from L2 hyperplane.

Zeta in Soft margin SVM

So now we can say that our that are SVM Error = Margin Error + Classification Error. The higher the margin, the lower would-be margin error, and vice versa.

Let’s say you take a high value of ‘c’ =1000, this would mean that you don’t want to focus on margin error and just want a model which doesn’t misclassify any data point.

Look at the figure below:

Two SVM models

If someone asks you which is a better model, the one where the margin is maximum and has 2 misclassified points or the one where the margin is very less, and all the points are correctly classified?

Well, there’s no correct answer to this question, but rather we can use SVM Error = Margin Error + Classification Error to justify this. If you don’t want any misclassification in the model then you can choose figure 2. That means we’ll increase ‘c’ to decrease Classification Error but if you want that your margin should be maximized then the value of ‘c’ should be minimized. That’s why ‘c’ is a hyperparameter and we find the optimal value of ‘c’ using GridsearchCV and cross-validation.

Kernels in Support Vector Machine

The most interesting feature of SVM is that it can even work with a non-linear dataset and for this, we use “Kernel Trick” which makes it easier to classifies the points. Suppose we have a dataset like this:

dataset, Kernels in Support Vector Machine (SVM)

Here we see we cannot draw a single line or say hyperplane which can classify the points correctly. So what we do is try converting this lower dimension space to a higher dimension space using some quadratic functions which will allow us to find a decision boundary that clearly divides the data points. These functions which help us do this are called Kernels and which kernel to use is purely determined by hyperparameter tuning.

Kernel and Decision Surface

Different Kernel Functions

Some kernel functions which you can use in SVM are given below:

1. Polynomial Kernel

Following is the formula for the polynomial kernel:

Formula for the polynomial kernel
  • Here d is the degree of the polynomial, which we need to specify manually.
  • Suppose we have two features X1 and X2 and output variable as Y, so using polynomial kernel we can write it as:
formula for the polynomial kernel

So we basically need to find X12 , X22 and X1.X2, and now we can see that 2 dimensions got converted into 5 dimensions.

A SVM using Polynomial kernal

2. Sigmoid Kernel

We can use it as the proxy for neural networks. Equation is:

Equation for Sigmoid kernal

It is just taking your input, mapping them to a value of 0 and 1 so that they can be separated by a simple straight line.

Support Vector Classifier using Sigmoid Kernal

Image Source: https://dataaspirant.com/svm-kernels/#t-1608054630725

3. RBF Kernel

What it actually does is to create non-linear combinations of our features to lift your samples onto a higher-dimensional feature space where we can use a linear decision boundary to separate your classes It is the most used kernel in SVM classifications, the following formula explains it mathematically:

Formula for RBF kernal

where,

  • ‘σ’ is the variance and our hyperparameter
  • ||X₁ – X₂|| is the Euclidean Distance between two points X₁ and X₂
Support Vector Classifier using rbf Kernal

4. Bessel function kernel

It is mainly used for eliminating the cross term in mathematical functions. Following is the formula of the Bessel function kernel:

formula of the Bessel function kernel

5. Anova Kernel

It performs well on multidimensional regression problems. The formula for this kernel function is:

Formula for Anova Kernel

How to Choose the Right Kernel? 

I am well aware of the fact that you must be having this doubt about how to decide which kernel function will work efficiently for your dataset. It is necessary to choose a good kernel function because the performance of the model depends on it.

Here are the Points to choose the right Kernal:

  • Kernel selection depends on the dataset type.
  • For linearly separable data, use a linear kernel:
    • It is simple and has lower complexity compared to other kernels.
    • Start by assuming your data is linearly separable and try the linear kernel first.
  • Move to more complex kernels if needed.
  • Commonly used kernels:
    • Linear and RBF (Radial Basis Function) are widely used.
    • Polynomial kernels are rarely used due to poor efficiency.
  • If both linear and RBF kernels give similar results:
    • Choose the simpler option, which is the linear kernel.

Example

Let’s understand this with the help of an example, for simplicity I’ll only take 2 features that mean 2 dimensions only. In the figure below I have plotted the decision boundary of a linear SVM on 2 features of the iris dataset:

Support Vector Machine (SVM) on Iris dataset

Here we see that a linear kernel works fine on this dataset, but now let’s see how will RBF kernel work.

RBF kernal

We can observe that both the kernels give similar results, both work well with our dataset but which one should we choose? Linear SVM is a parametric model. A Parametric Model is a concept used to describe a model in which all its data is represented within its parameters.  In short, the only information needed to predict the future from the current value is the parameters.

Here are some points you should go through:

  • The complexity of the RBF kernel increases with the size of the training data.
  • Preparing the RBF kernel is computationally expensive.
  • The kernel matrix must be stored and maintained, which requires additional memory.
  • Projection into the “infinite” higher-dimensional space (where data becomes linearly separable) is costly, especially during prediction.
  • Using a linear kernel on a non-linear dataset results in very low accuracy and is not suitable.
Using linear kernel for Support Vector Machine(SVM)

So for this kind of dataset, we can use RBF without even a second thought because it makes decision boundary like this:

Using RBF kernel in Support Vector Machine(SVM)

Implementation and Hyperparameter Tuning of Support Vector Machine in Python

For implementation on a dataset, we will be using the Income Evaluation dataset, which has information about an individual’s personal life and an output of 50K or <=50. The dataset can be found here.

The task here is to classify the income of an individual when given the required inputs about his personal life.

First, let’s import all required libraries.

# Import all relevant libraries

from sklearn.svm import SVC

import numpy as np

import pandas as pd

from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score, confusion_matrix

from sklearn import preprocessing

import warnings

warnings.filterwarnings("ignore")

Now let’s read the dataset and look at the columns to understand the information better.

df = pd.read_csv('income_evaluation.csv')

df.head()
Income evaluation dataframe

I have already done the data preprocessing part and you can look whole code here. Here my main aim is to tell you how to implement SVM on python.

Now for training and testing our model, the data has to be divided into train and test data.
We will also scale the data to lie between 0 and 1.

# Split dataset into test and train data

Now let’s go ahead with defining the Support vector Classifier along with its hyperparameters. Next, we will fit this model on the training data.

#Define support vector classifier with hyperparameters

X_train, X_test, y_train, y_test = train_test_split(df.drop(‘income’, axis=1),df[‘income’], test_size=0.2)

svc.fit(X_train,y_train)

svc = SVC(random_state=101)

accuracies = cross_val_score(svc,X_train,y_train,cv=5)

print("Train Score:"np.mean(accuracies))

printf("Test Score:"svc.score(X_test,y_test))

The model has been trained and we can now observe the outputs as well.

Below, you can see the accuracy of the test and train dataset

You can even hyper tune your model by the following code:

grid = {

    'C':[0.01,0.1,1,10],

    'kernel' : ["linear","poly","rbf","sigmoid"],

    'degree' : [1,3,5,7],

    'gamma' : [0.01,1]

}

 svm  = SVC ()

svm_cv = GridSearchCV(svm, grid, cv = 5)

svm_cv.fit(X_train,y_train)

print("Best Parameters:",svm_cv.best_params_)

print("Train Score:",svm_cv.best_score_)

print("Test Score:",svm_cv.score(X_test,y_test))

The dataset is pretty big and hence it will take time to get trained, for this reason, I can’t paste the result of the above code here because SVM doesn’t perform well with big datasets, it takes a long time to get trained.

SVM in Machine Learning

A popular and reliable supervised machine learning technique called Support Vector Machine (SVM) was first created for classification tasks, though it can also be modified to solve regression issues. The goal of SVM is to locate in the feature space the optimal separation hyperplane between classes. Checkout these feature :

  1. Hyperplane:
    • Acts as the decision boundary in the feature space, separating different classes.
    • In 2D, it appears as a line; in higher dimensions, it becomes a flat affine subspace.
  2. Margin:
    • Represents the distance between the hyperplane and the closest data points of any class.
    • SVM aims to maximize this margin to ensure the widest possible separation between classes.
    • The goal is to create the largest possible “street” between classes without misclassification.
  3. Support Vectors:
    • These are the data points closest to the hyperplane.
    • They are critical in determining the hyperplane’s position and orientation.
    • Support vectors directly influence the optimal hyperplane.

Conclusion

In this article, we looked at a very powerful machine learning algorithm, Support Vector Machine in detail. I discussed its concept of working, math intuition behind SVM, implementation in python, the tricks to classify non-linear datasets, Pros and cons, and finally, we solved a problem with the help of SVM. Also you will get some insights on SVM in Machine Learning.

Frequently Asked Questions

Q1. What is a SVM algorithm?

A. SVM algorithm is used for both classification and regression tasks. It finds an optimal hyperplane to separate data points of different classes in a high-dimensional space.

Q2. Why is SVM the best algorithm?

A. SVM is considered one of the best algorithms because it can handle high-dimensional data, is effective in cases with limited training samples, and can handle non-linear classification using kernel functions.

Q3. What are the steps of SVM algorithm?

A. The steps of the SVM algorithm involve: (1) selecting the appropriate kernel function, (2) defining the parameters and constraints, (3) solving the optimization problem to find the optimal hyperplane, and (4) making predictions based on the learned model.

Q4. What does SVM do in machine learning?

A. In machine learning, SVM is used to classify data by finding the optimal decision boundary that maximally separates different classes. It aims to find the best hyperplane that maximizes the margin between support vectors, enabling effective classification even in complex, non-linear scenarios.

I have recently graduated with a Bachelor's degree in Statistics and am passionate about pursuing a career in the field of data science, machine learning, and artificial intelligence. Throughout my academic journey, I thoroughly enjoyed exploring data to uncover valuable insights and trends.

I am eager to continue learning and expanding my knowledge in the field of data science. I am particularly interested in exploring deep learning and natural language processing, and I am constantly seeking out new challenges to improve my skills. My ultimate goal is to use my expertise to help businesses and organizations make data-driven decisions and drive growth and success.

Responses From Readers

Clear

Prakash Sridharan
Prakash Sridharan

I think there is a need to validate the code that has been shared here. There are many gaps and errors when executing. Perhaps you could evaluate this on google collaboratory before submitting.

saba
saba

hello sir, you explained SVM very well, thank you

Dilip
Dilip

Very well explained. Thank you 🙏

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details