With the overwhelming hype of feature selection in machine learning and data science today, you might wonder why you should care about feature selection. The answer is that most machine-learning models require a large amount of training data. If you don’t have enough data, you will have difficulty training the model. In addition, having too many features means you’re likely to get overfit. Overfitting occurs when a model learns from noise instead of the true data. Hence, it is essential to choose some or a limited number of the most significant data features to train our models. Hence the concept of ‘Feature Selection’ comes into the picture.
Let us start by answering the basic question, ‘What is Feature Selection?’
This article was published as a part of the Data Science Blogathon.
Feature selection reduces the input variable of your model by using only relevant data and getting rid of noise data.
The criterion for choosing the features depends on the purpose of performing feature selection. Given the data and the number of features, we need to find the set of features that best satisfies the criteria. Ideally, the best subset would be the one that gives the best performance.
In real-time, the data that we use for our machine learning and data science applications has many drawbacks to it.
Therefore, choosing and feeding the machine learning model with only optimal features that best influence the target variable is crucial.
Having understood why it is important to include the feature selection process while building machine learning models, let us see what are the problems faced during the process.
Feature selection can be made using numerous methods. The three main types of feature selection techniques are:
Let us look into each of these methods in detail. There are generally two phases in filter and wrapper methods – the feature selection phase ( Phase 1) and the feature evaluation phase (Phase 2).
Feature selection using filter methods is made by using some information, distance, or correlation measures. Here, the features’ sub-setting is generally done using one of the statistical measures like the Chi-square test, ANOVA test, or correlation coefficient. These help in selecting the attributes that are highly correlated with the target variable. Here, we work on the same model by changing the features.
Why should you be choosing the filter method?
In wrapper methods, we generate a new model for each feature subset that is generated. The performance of each of these is recorded and the features which produce the best performance model are used for training and testing the final algorithm. Unlike filter methods that use distance or information-based measures for feature selection, wrapper methods use many simple techniques for choosing the most significant attributes. They are:
(1) Forward Selection
It is an iterative greedy process where you start with absolutely no features and in each iteration, you keep adding one most significant feature. Here, the variables are added in the decreasing order of their correlation with the target variable.
This addition of a new attribute is done until the model’s performance does not increase on further adding other features that are when you reach the point where you get the best possible performance.
(2) Backward Elimination
As the name suggests, here we start with all the features present in the dataset, and with each iteration, we remove one least significant variable.
We remove the attributes until there is no improvement in the model’s performance on eliminating features. The least correlated feature with the target variable is chosen based on certain statistical measures. In contrast to the filter methods, the features are removed in the increasing order of correlation with the target variable.
It is also possible to combine both these methods. This is often called Bidirectional Elimination. This is similar to forward selection but the only difference is that if it finds any already added feature to be insignificant at a later stage when a new feature is added, it removes the former through backward elimination.
It is worth noting that wrapper methods may work very effectively for certain learning algorithms. However, the computational costs are very high when these wrapper methods as compared to filter methods.
In embedded methods, all the combinations of the features are generated. Then each of these combinations of attributes is used to train the model, and as usual, its performance is observed. The combination which gives the best performance is chosen for the final training.
The choice of technique used for feature selection depends on the application and the dataset’s size and requires an in-depth understanding of the dataset. As mentioned before,
With this, we conclude our discussion of feature selection. To summarize, we began by defining feature selection and comprehending its significance. Later on, we looked at the problems encountered during it and how knowing different attribute selection methods can help us overcome those problems.
The main takeaways from this article are:
I hope you liked my article. If you have any opinions or questions, then comment below.
Connect with me on LinkedIn for further discussion.
The media shown in this article is not owned by Analytics Vidhya and is used at the author’s discretion.
A well-written article.