This article was published as a part of the Data Science Blogathon.
Feature analysis is an important step in building any predictive model. It helps us in understanding the relationship between dependent and independent variables. In this article, we will look into a very simple feature analysis technique that can be used in cases such as binary classification problems. The underlying idea is to quantify the relationships between each independent variable and dependent variable at certain levels of values for each variable, this can help in identifying a subset of variables that are more important and also know about the important levels of particular feature values.
This knowledge about features can help us in deciding which features to include in the initial model. It can also help reduce the overall complexity of the predictive model by converting continuous numerical variables to categorical types by way of binning them. In particular, we will look at a supervised feature analysis approach also known as bivariate feature analysis.
Wine Quality Dataset –
The dataset used in this article is publicly available from the UCI Machine Learning Repository,
Link: https://archive-beta.ics.uci.edu/ml/datasets/wine+quality
Attributes/Features List
Source: Author
Output (Target) variable: quality (score between 0 and 10)
Let’s look at the distribution of target variable,
Source: Author
A higher score means better quality of the wine.
Let’s consider scores of 8 & 9 as an excellent quality group and the rest of the scores as a non-excellent quality group.
Source: Author
Thus, our problem statement now is to predict the quality of wine as Excellent or Non-Excellent based on the available features.
Note that the classes are imbalanced here, we have very few examples of Excellent quality wines as compared to Non-Excellent ones. These kinds of situations are often encountered in problems such as credit card fraud detection, insurance claim fraud identification or disease detection. Identifying the most relevant features in these kinds of situations can help businesses prioritize & focus on important features thereby significantly improving the predictive power of the underlying model while keeping it simple.
We will use feature analysis methods to see which of these are relevant variables for determining the excellent quality wine.
We will follow a supervised feature analysis approach. In particular, we will use the target variable along with independent variables to check their relationships.
Let’s split the data as train & test sets,
After splitting the dataset into train & test sets the target distribution will look as below,
Source: Author
In order to produce the same results/numbers as above refer to the data exploration code below and set the random_state=100
In this dataset, all the features are numeric. The values of these features are continuous in nature. In order to understand how these are related to the target variable, we need to look at the buckets/level of each variable value.
The helper function below will compute several statistics in order to analyse the relationship between each bucket of values and the target variable.
Using the feature_analysis helper function above we will get a feature analysis dataframe.
all_features_list = ['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol']
feature_analysis_df = feature_analysis(y='target', features=all_features_list, trn_df=trn_wine_data, tst_df=tst_wine_data, y_hat=None)feature_analysis_df = feature_analysis(y='target', features=all_features_list, trn_df=trn_wine_data, tst_df=tst_wine_data, y_hat=None)
We can see that each feature is broken down into various levels.
Table 1: Feature analysis of feature – fixed acidity
Source: Author
Since we randomly sampled train and test datasets hence the overall target percentages of train and test splits are nearly the same.
The overall target percentage is like a benchmark against which each of the levels can be compared.
The objective of this analysis is to know,
Let’s look at each feature individually,
For feature fixed acidity we see that (Refer Table 1),
For feature alcohol, we see that
Table 2: Feature analysis of feature – alcohol
Source: Author
For feature sulphates, we see that
Table 3: Feature analysis for feature – sulphates
Source: Author
For feature pH, we see that
Table 4: Feature analysis for feature – pH
Source: Author
For feature volatile acidity, we see that
Table 5: Feature analysis for feature – volatile acidity
Source: Author
For feature citric acid, we see that
Table 6: Feature analysis for feature – citric acid
Source: Author
For feature residual sugar, we see that
Table 7: Feature analysis for feature – residual sugar
Source: Author
For feature chlorides, we see that
Table 8: Feature analysis technique for feature – chlorides
Source: Author
For feature free sulphur dioxide, we see that
Table 9: Feature analysis technique for feature-free sulphur dioxide
Source: Author
For feature total sulphur dioxide, we see that
Table 10: Feature analysis technique for feature total sulphur dioxide
Source: Author
For feature density, we see that
Table 11: Feature analysis for feature density
Source: Author
Great, we covered all the features!
The above analysis is very simple in nature but greatly helps us in giving a first-hand overview of the nature of the individual features & also the important levels too!
In particular, this simple analysis of comparing the target percentages in each level of feature values gave us a lot of insights. It helped us identify important features such as high alcohol value, it also helped us identify the important levels of certain features such as initial level of values for features chlorides and sulphates.
This analysis can further help in compressing the levels and bin the features to only include important levels thereby converting the continuous feature into a categorical type thereby reducing the overall complexity of the model.
Hope you will find this feature analysis technique useful in your work! Read the latest articles on our blog.
The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion.
Hi, sir. I found this article very helpful. I'm new to python and there are no resources using the feature analysis function and i love how you explained it. do you mind sharing the complete code? Hope you can help me with this request, sir.