Picture this: you’re on a quest to find the perfect algorithm that can effortlessly distinguish between apples and oranges, even when they’re mixed together in a basket. Enter Support Vector Machines, or SVM for short, your trusty guide in the realm of machine learning. Soft margin SVM is like a savvy detective, armed with the power to draw clear lines between different classes of data points, enabling it to make accurate predictions with remarkable precision.
This article aims to provide a basic understanding of the SVM, the optimization that is happening behind the scene, and knowledge about its parameters along with its implementation in Python.
In this article, you will explore support vector machines (SVM), understand the support vector machine algorithm, and see practical support vector machine examples in the context of machine learning.
This article was published as a part of the Data Science Blogathon.
Support Vector Machine serves as a supervised learning algorithm applicable for both classification and regression problems, though it finds its primary use in classification tasks. Class labels are denoted as -1 for the negative class and +1 for the positive class in Support Vector Machine.
The main task of the classification problem is to find the best separating hyperplane/ Decision boundary. Lagrange multipliers play a crucial role in optimizing the objective function of SVM. We can have the ‘n-1’ hyperplane, which can be either linear or nonlinear. Such data points are called Support vectors, which are simply feature values in vector form. Logistic regression can also be used as a classifier in SVM.
From the above figure, we can see that Hyperplane (HP4) is the best as it is able to correctly classify all the data points including support vectors. In the context of Support Vector Machines (SVM), margins refer to the separation between the decision boundary and the closest data points from each class
Margins represent the width of the corridor that the SVM algorithm aims to maximize when finding the optimal hyperplane to separate different classes of data. The larger the margin, the greater the confidence in the classification made by the SVM model.
By maximizing the margin, soft margin SVM not only aims to correctly classify the training data but also seeks robustness against noise and outliers in the dataset. This margin maximization is a key principle behind SVM’s ability to generalize well to unseen data, making it a powerful tool in machine learning classification tasks.
Another point to note from the above figure is that the further the data points are from the margins, the more correctly they are classified.
These are two variants of the Support Vector Machine algorithm, each suited for different types of data distributions and classification tasks.
In summary, linear SVM is appropriate for linearly separable data, while non-linear SVM is used for data with complex, non-linear relationships. The choice between the two depends on the nature of the dataset and the problem at hand
Read this article about Guide on Support Vector Machine
The core of any Machine learning algorithm is the Optimization technique that is happening behind the scene.
Soft margin SVM maximizes the margin by learning a suitable decision boundary/decision surface/separating hyperplane.
The optimization technique used in Support Vector Machines (SVM) involves solving a convex optimization problem to find the optimal hyperplane that maximizes the margin between classes. This optimization problem aims to minimize the classification error while maximizing the margin, which is the distance between the decision boundary and the closest data points from each class.
Formally, the optimization problem in SVM can be expressed as:
minw,b12∣∣w∣∣2+C∑i=1Nξiminw,b21∣∣w∣∣2+C∑i=1Nξi
subject to:
yi(w⋅xi+b)≥1−ξiyi(w⋅xi+b)≥1−ξi
ξi≥0ξi≥0
where:
CC is the regularization parameter that controls the trade-off between maximizing the margin and minimizing the classification error,
NN is the number of training examples.
a. We can clearly see that SVM tries to maximize the margins and thus called Maximum Margin Classifier.
b. The Support Vectors will have values exactly either {+1, -1}.
c. The more negative the values are for the Green data points the better it is for classification.
d. The more positive the values are for the Red data points the better it is for classification
For more in-depth knowledge regarding the maths behind Support Vector Machine refer to this article
Choosing a correct classifier is really important. Let us understand this with an example.
Suppose we are given 2 Hyperplane one with 100% accuracy (HP1) on the left side and another with >90% accuracy (HP2) on the right side. Which one would you think is the correct classifier?
Most of us would pick the HP2 thinking that it because of the maximum margin. But it is the wrong answer.
But Support Vector Machine would choose the HP1 though it has a narrow margin. Because though HP2 has maximum margin but it is going against the constrain that: each data point must lie on the correct side of the margin and there should be no misclassification. This constrain is the hard constrain that Support Vector Machine follows throughout.
Support Vector Machine allows different kernel functions like linear, polynomial, sigmoid, and radial basis function (RBF). The choice of kernel depends on the data and the problem you are trying to solve. Linear kernels work well for linearly separable data, while non-linear kernels like RBF are suitable for more complex data distributions.
This parameter controls the trade-off between achieving a low training error and minimizing the norm of the weights. A higher value of C allows for more flexibility in the decision boundary, potentially leading to overfitting, while a lower value of C imposes a smoother decision boundary and may lead to underfitting.
Gamma is a parameter for non-linear hyperplanes. It defines how far the influence of a single training example reaches, with low values meaning ‘far’ and high values meaning ‘close.’ A higher gamma value will result in more complex decision boundaries, which may lead to overfitting.
Use techniques like k-fold cross-validation to evaluate different Support Vector Machine models with various hyperparameters. Cross-validation helps in selecting the model with the best generalization performance on unseen data.
Consider the size and complexity of your dataset. For large datasets, linear SVMs with a linear kernel or stochastic gradient descent (SGD) SVMs are often preferred due to their computational efficiency. For smaller datasets or when dealing with non-linearly separable data, non-linear kernels like RBF may be more appropriate.
Understand the characteristics of your problem, such as the nature of the data distribution, the presence of noise or outliers, and the importance of interpretability versus accuracy. These factors can influence the choice of Support Vector Machine variant and its hyperparameters.
Choose a suitable library or implementation of soft margin SVM that offers flexibility, efficiency, and ease of use for your specific task. Popular libraries include scikit-learn in Python, LIBSVM, and SVMlight.
By carefully considering these factors and experimenting with different Support Vector Machine configurations, you can choose the correct SVM model that best fits your data and problem requirements.
This brings us to the discussion about Hard and Soft SVM.
I would like to again continue with the above example.
We can now clearly state that HP1 is a Hard SVM (left side) while HP2 is a Soft SVM (right side).
Hard SVM and Soft SVM are variations of the Support Vector Machine algorithm, differing primarily in how they handle classification errors and the margin.
In Hard SVM, the algorithm aims to find the hyperplane that separates the classes with the maximum margin while strictly enforcing that all data points are correctly classified. Assuming that the data is linearly separable, it implies the existence of at least one hyperplane that can perfectly separate the classes without any misclassifications. However, Hard SVM does not tolerate any misclassification errors and demands the data to be perfectly separable, which can be overly restrictive and might lead to poor performance on noisy or overlapping datasets.
Soft SVM, also known as C-SVM (C for the regularization parameter), relaxes the strict requirement of Hard SVM by allowing some misclassification errors. It introduces a regularization parameter (C) that controls the trade-off between maximizing the margin and minimizing the classification error. A smaller value of C allows for a wider margin and more misclassifications, while a larger value of C penalizes misclassifications more heavily, leading to a narrower margin. Soft SVM is suitable for cases where the data may not be perfectly separable or contains noise or outliers. It provides a more robust and flexible approach to classification, often yielding better performance in practical scenarios.
By default, Support Vector Machine implements Hard margin SVM. It works well only if our data is linearly separable.
If our data is non-separable or nonlinear, then the Hard margin Support Vector Machine will not return any hyperplane since it cannot separate the data. This is where Soft Margin SVM comes to the rescue, employing techniques such as primal formulation, Gaussian kernel, and dual problem to handle such cases effectively.
Now that we know what the Regularization parameter (C) does. We need to understand its relation with Support Vector Machine.
Other significant parameters of Support Vector Machine are the Gamma values. It tells us how much will be the influence of the individual data points on the decision boundary.
– Large Gamma: Fewer data points will influence the decision boundary. Therefore, decision boundary becomes non-linear leading to overfitting
– Small Gamma: More data points will influence the decision boundary. Therefore, the decision boundary is more generic.
Support Vector Machine deals with nonlinear data by transforming it into a higher dimension where it is linearly separable. Support Vector Machine does so by using different values of Kernel. We have various options available with kernel like, ‘linear’, “rbf”, ”poly” and others (default value is “rbf”). Here “rbf” and “poly” are useful for non-linear hyper-plane.
From the above figure, it is clear that choosing the right kernel is very important in order to get the correct results.
For this part, I will be using the Iris dataset.
Python Code:
import pandas as pd
import numpy as np
#import matplotlib.pyplot as plt
#import seaborn as sns
#from matplotlib.colors import ListedColormap
#from sklearn import svm, datasets
#from sklearn.svm import SVC
iris = pd.read_csv("iris.csv")
X = iris[['SepalLengthCm','SepalWidthCm']] # we only take the first two features. We could
Y = iris.Species # avoid this ugly slicing by using a two-dim dataset
print(iris.head())
def decision_boundary(X,y,model,res,test_idx=None):
markers=['s','o','x']
colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan')
colormap=ListedColormap(colors[:len(np.unique(y))])
x_min,x_max=X[:,0].min()-1,X[:,0].max()+1
y_min,y_max=X[:,1].min()-1,X[:,1].max()+1
xx,yy=np.meshgrid(np.arange(x_min,x_max,res),np.arange(y_min,y_max,res))
z=model.predict(np.c_[xx.ravel(), yy.ravel()])
zz=z.reshape(xx.shape)
plt.pcolormesh(xx,yy,zz,cmap=colormap)
for idx,cl in enumerate(np.unique(y)):
plt.scatter(X[y==cl,0],X[y==cl,1],c=colors[idx],cmap=plt.cm.Paired, edgecolors='k',marker=markers[idx],label=cl,alpha=0.8)
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
X_train,X_test,y_train,y_test=train_test_split(X,Y,test_size=0.3)
scaler=StandardScaler()
scaler.fit(X_train)
X_train_new=scaler.transform(X_train)
X_test_new=scaler.transform(X_test)
model=SVC(C=10**10)model.fit(X_train,y_train) # Hard SVM
decision_boundary(np.vstack((X_train,X_test)),np.hstack((y_train,y_test)),model,0.08,test_idx=None)
plt.xlabel('sepal length ')
plt.ylabel('sepal width ')
plt.legend(loc='upper left')
plt.tight_layout()
plt.show()
model=SVC(C=100) # Soft SVM
model.fit(X_train,y_train)
decision_boundary(np.vstack((X_train,X_test)),np.hstack((y_train,y_test)),model,0.08,test_idx=None)
plt.xlabel('sepal length ')
plt.ylabel('sepal width ')
plt.legend(loc='upper left')
plt.tight_layout()
plt.show()
We can clearly see that Soft SVM allows for some misclassification, unlike Hard SVM.
plt.figure(figsize=(5,5))
model = SVC(kernel='rbf', random_state=1, gamma=1.0, C=10.0)
model.fit(X_train_new,y_train)
decision_boundary(np.vstack((X_train_new,X_test_new)),np.hstack((y_train,y_test)),model,0.02,test_idx=None)
plt.title('Gamma=1.0')
plt.legend(loc='upper left')
plt.tight_layout()
plt.show()
plt.figure(figsize=(5,5))
model = SVC(kernel='rbf', random_state=1, gamma=10.0, C=10.0)
model.fit(X_train_new,y_train)
decision_boundary(np.vstack((X_train_new,X_test_new)),np.hstack((y_train,y_test)),model,0.02,test_idx=None)
plt.title('Gamma=10.0')
plt.legend(loc='upper left')
plt.tight_layout()
plt.show()
plt.figure(figsize=(5,5))
model = SVC(kernel='rbf', random_state=1, gamma=100.0, C=10.0)
model.fit(X_train_new,y_train)
decision_boundary(np.vstack((X_train_new,X_test_new)),np.hstack((y_train,y_test)),model,0.02,test_idx=None)
plt.title('Gamma=100.0')
plt.legend(loc='upper left')
plt.tight_layout()
plt.show()
From the above plots, we can see that when we increase the value of Gamma the decision boundary becomes non-linear and leads to over-fitting.
It is generally preferred to keep Gamma value small in order to have a more ‘Generalized Model’.
For this part, I have created a function for creating sub-plots along with Decision-Boundary.
def create_mesh(x,y,res=0.02):
x_min,x_max=x.min()-1,x.max()+1
y_min,y_max=y.min()-1,y.max()+1
xx,yy=np.meshgrid(np.arange(x_min,x_max,res),np.arange(y_min,y_max,res))
return xx,yy
def create_contours(ax,clf,xx,yy,**parameters):
z=clf.predict(np.c_[xx.ravel(),yy.ravel()])
zz=z.reshape(xx.shape)
out = ax.contourf(xx, yy, zz)
return out
## Creating the sub-plots
models = (svm.SVC(kernel='linear', C=1.0),
svm.SVC(C=1.0),SVC(C=10**10,kernel='linear'),SVC(C=10**10,kernel='rbf'))
models = (clf.fit(X_train, y_train) for clf in models)
# title for the plots
titles = ('Soft SVC with linear kernel',
'Soft SVC with rbf kernel', 'Hard -SVC with linear kernel','Hard -SVC with rbf kernel')
# Set-up 2x2 grid for plotting.
fig, sub = plt.subplots(2, 2,figsize=(10,10))
plt.subplots_adjust(wspace=0.4, hspace=0.4)
xx,yy=create_mesh(X[:,0], X[:,1])
for clf, title, ax in zip(models, titles, sub.flatten()):
markers=['s','o','x']
colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan')
colormap=ListedColormap(colors[:len(np.unique(Y))])
create_contours(ax, clf, xx, yy,cmap=colormap)
for idx,cl in enumerate(np.unique(Y)):
ax.scatter(X[Y==cl,0],X[Y==cl,1],c=colors[idx],cmap=colormap, edgecolors='k',marker=markers[idx],label=cl,alpha=0.8)
ax.set_xlim(xx.min(), xx.max())
ax.set_ylim(yy.min(), yy.max())
ax.set_xlabel('Sepal length')
ax.set_ylabel('Sepal width')
ax.set_xticks(())
ax.set_yticks(())
ax.set_title(title)
plt.show()
Support Vector Machine (SVM) stands as a powerful tool in data science, adept at tackling classification and regression challenges. This tutorial demystified soft margin SVM, from fundamentals to Python implementation. We explored SVM’s optimization intricacies, crucial in balancing margin maximization and misclassification minimization, especially in binary classification. Understanding regularization parameters and kernel selection fine-tunes SVM models for optimal performance. By contrasting hard and soft SVM, we grasped SVM’s adaptability to varying data complexities. Emphasizing experimentation and understanding problem characteristics highlighted the importance of selecting the right SVM model. Armed with this knowledge, you can harness SVM’s power, making informed decisions in machine learning.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.
A. Hard margin SVM aims for perfect separation without misclassification, suitable only for linearly separable data. Soft margin SVM allows some misclassification, controlled by a regularization parameter (C), leading to a wider margin and better generalization.
A. In hard margin SVM classification, we prefer a larger margin because it allows for a more robust and generalizable model. A larger margin indicates greater separation between classes, reducing the risk of overfitting and improving the model’s ability to classify unseen data accurately.
A. The optimization problem in Support Vector Machine involves finding the hyperplane that maximizes the margin between different classes while minimizing the classification error. Typically, this is solved as a convex optimization problem using techniques like quadratic programming. The objective is to find the optimal hyperplane that best separates the classes with the maximum margin, ensuring robustness and generalization to unseen data.
A. A binary classifier in SVM refers to the nature of the classification task, where the algorithm distinguishes between two classes or categories. SVM inherently functions as a binary classifier, meaning it is designed to handle problems with two classes. However, techniques like one-vs-all or one-vs-one can extend SVM to multi-class classification tasks by combining multiple binary classifiers.
A. In SVM, the loss function is often referred to as the hinge loss function. It quantifies the loss incurred by the model for misclassifying data points. The hinge loss function encourages the correct classification of training examples while penalizing misclassifications.