This article was published as a part of the Data Science Blogathon
What is a feature, and why do we need it engineered? In general, all machine learning algorithms use some form of input data to generate outputs. This input data consists of feature engineering techniques, which are in the form of structured columns. Algorithms require features with a specific characteristic to function better. The need for feature engineering arises in this situation.
I believe that feature engineering efforts are primarily motivated by two objectives:
The features you use influence more than everything else the result. No algorithm alone, to my knowledge, can supplement the information gain given by correct feature engineering.
— Luca Massaron
According to a Forbes survey, data scientists spend 80% of their time preparing data:
This metric demonstrates the significance of feature engineering in data science. As a result, I decided to write this article, which summarises the main techniques of feature engineering and provides brief descriptions of each.
I also included some simple Python scripts for each technique. To use them, you must first import the Pandas and Numpy libraries.
Some of the techniques listed below may work better with specific algorithms or datasets, while others may be helpful in all cases. This post does not intend to delve too deeply into this topic. It is possible to publish a post for each of the methods listed below, and I had attempted to keep the explanations brief and informative.
Practising different techniques on different datasets and observing their effect on model performance is the best way to gain expertise in feature engineering.
Missing values are one of the most common issues that arise when attempting to prepare data for machine learning. Human errors, interruptions in the data flow, privacy concerns, and other factors could be the reason for missing values. Missing values, for whatever reason, have an impact on the performance of machine learning models.
Some machine learning platforms automatically drop rows with missing values during the model training phase, which reduces model performance due to the reduced training size. On the other hand, most algorithms reject datasets with missing values and return an error.
The most straightforward way to deal with missing values is to remove the rows or the entire column. There is no optimal dropping criterion, however, you can take 80% as an example and drop the rows and columns with missing values greater than that proportion.
threshold_value = 0.8
#Dropping columns with missing value rate higher than threshold data = data[data.columns[data.isnull().mean() < threshold_value]] #Dropping rows with missing value rate higher than threshold data = data.loc[data.isnull().mean(axis=1) < threshold_value]
Imputation is performed to dropping since it retains data size. However, there is a significant selection of what you replace with the missing numbers. I recommend starting by contemplating a suitable default value for missing values in the field. For example, if you have a column with only 1 and NA, the NA rows likely correspond to 0. For example, if you have a column that shows the “customer visit count in the last month,” replace the missing numbers with 0 if it is a reasonable option.
Another cause of missing numbers is combining tables of various sizes, and in this situation, replacing 0 may be appropriate as well.
#filling all missing values with 0
data = data.fillna(0)
# filling missing values with median of columns
data = data.fillna(data.median)
To handle categorical variables, replacing missing values of columns with the mode is a good choice. If there is no dominant value and the features are uniform, imputing a category like “unknown” is sensible, whereas your imputation is likely to converge a random selection.
#Max fill function for categorical columns
data['column_name'].fillna(data['column_name'].value_counts()
.idxmax(), inplace=True)
This can be applied to both numerical and categorical data.
#Numerical Bin example
Value Bin
0-30 -> Low
31-70 -> Med
71-100 -> High
#Categorical Bin example
Value Bin
Spain -> Europe
Italy -> Europe
Chile -> South America
The main reason for binning is to make the model more robust and to prevent overfitting; however, it comes at a cost in terms of performance. Every time you throw something away, you give up information and make your data more regular. (For more information, see regularisation in machine learning.)
The main motto of the binning process is the trade-off between performance and overfitting. Binding, in my opinion, maybe redundant for some types of algorithms for numerical columns, except for some obvious overfitting cases, due to its effect on model performance.
However, for categorical columns, labels with low frequencies are likely to harm the robustness of statistical models. Assigning a customary category to these less frequent values thus contributes to the model’s robustness. For example, your data set contains 10,000 rows, it might be a good idea to group labels with a count of less than 100 into a new category called “New.”
#Numerical Binning Example data['bin'] = pd.cut(data['value'], bins=[0,30,70,100], labels=["Low", "Mid", "High"]) value bin 0 2 Low 1 45 Mid 2 7 Low 3 85 High 4 28 Low #Categorical Binning Example Country 0 Spain 1 Chile 2 Australia 3 Italy 4 Brazil conditions = [ data['Country'].str.contains('Spain'), data['Country'].str.contains('Italy'), data['Country'].str.contains('Chile'), data['Country'].str.contains('Brazil')] choices = ['Europe', 'Europe', 'South America', 'South America'] data['Continent'] = np.select(conditions, choices, default='Other') Country Continent 0 Spain Europe 1 Chile South America 2 Australia Other 3 Italy Europe 4 Brazil South America
Before discussing how to handle outliers, I’d like to point out that visualising the data is the best way to detect outliers. All other statistical methodologies are prone to error, whereas visualising outliers allows for a more precise decision.
Statistical methodologies, as previously stated, are less precise, but they have an advantage in that they are fast. In this section, I will discuss two approaches to dealing with outliers. These will detect them through the use of standard deviation and percentiles.
If a value’s distance from the average is more than x * standard deviation, it is considered an outlier. So, what should x be?
There is no simple solution for x, but a value between 2 and 4 seems reasonable.
#Dropping the outlier rows with standard deviation factor = 3 upper_limt = data['column'].mean () + data['column'].std () * factor lower_limt = data['column'].mean () - data['column'].std () * factor data = data[(data['column'] lower_limt)]
Furthermore, the z-score can be substituted for the formula above. To standardise the distance between a value and the mean in the Z-score (or standard score) use standard deviation.
The use of percentiles is another mathematical method for detecting outliers. As an outlier, you can take a certain percentage of the value from the top or bottom. The main point here is to reset the percentage value, which is determined by the distribution of your data, as previously mentioned.
Furthermore, a simple error is to use percentiles based on the data range. In other words, if your data ranges from 0 to 100, the values between 96 and 100 do not constitute your top 5%. The top 5% of features are those that are less than the 95th percentile of data.
#Dropping the outlier rows with Percentiles upper_lim = data['column'].quantile(.95) lower_lim = data['column'].quantile(.05) data = data[(data['column'] lower_lim)]
The logarithm transformation (or log transform) is a famous mathematical transformation in feature engineering. What are the advantages of log transformation:
Note: If you apply log transform on data that has only positive values, you will receive an error. Also, before transforming your data, you can add 1 to it. As a result, you assure that the transformation’s output is positive.
#Log Transform Example data = pd.DataFrame({'value':[2,45, -23, 85, 28, 2, 35, -12]}) data['log+1'] = (data['value']+1).transform(np.log) #Negative Values Handling #Note that the values are different data['log'] = (data['value']-data['value'].min()+1) .transform(np.log) value log(x+1) log(x-min(x)+1) 0 2 1.09861 3.25810 1 45 3.82864 4.23411 2 -23 nan 0.00000 3 85 4.45435 4.69135 4 28 3.36730 3.95124 5 2 1.09861 3.25810 6 35 3.58352 4.07754 7 -12 nan 2.48491
It is one of the most common encoding methods in Machine Learning. Features spread across columns to multiple flag columns and assign 0 or 1 to them. These values express the relation between grouped and encoded columns.
Categorical data is challenging to understand for algorithms. This encoding changes to numerical format and allows to group categorical data without losing information.
If you have N unique values in the column, it is good to map them to N-1 binary columns where missing values can deduct from other columns. If all the column values are 0, then the missing value must be equal to 1. It is the reason why it is known as One-Hot Encoding.
Here’s an example of the get_dummies function of pandas that map all column values to multiple features.
encoded = pd.get_dummies(data['column']) data = data.join(encoded).drop('column', axis=1)
In terms of ML, splitting features is the best way to make them more valuable. The dataset almost always contains string columns, which violates tidy data rules. By isolating the informative bits of a column and transforming them into new features:
Splitting features is a smart choice, but there is no one-size-fits-all solution. How to split the column is determined by the column’s attributes. Let’s start with a couple of examples. For starters, here’s a simple split method for a regular name column:
data.name 0 Luther N. Gonzalez 1 Charles M. Young 2 Terry Lawson 3 Taylor White 4 Thomas Logsdon#Extracting first names data.name.str.split(" ").map(lambda x: x[0]) 0 Luther 1 Charles 2 Terry 3 Taylor 4 Thomas#Extracting last names data.name.str.split(" ").map(lambda x: x[-1]) 0 Gonzalez 1 Young 2 Lawson 3 White 4 Logsdon
The first and last items in the example above handle names longer than two words, making the function robust for corner cases when processing strings like that.
To extract a string segment between two characters split method is helpful. The following example is using two split functions in a row to understand the above case.
#String extraction example data.title.head() 0 Toy Story (1995) 1 Jumanji (1995) 2 Grumpier Old Men (1995) 3 Waiting to Exhale (1995) 4 Father of the Bride Part II (1995) data.title.str.split("(", n=1, expand=True)[1].str.split(")", n=1, expand=True)[0] 0 1995 1 1995 2 1995 3 1995 4 1995
The row represents every instance, and columns consist of different features of each example. This kind of data is known as Tidy.
We group the data by example, and each instance is known by only one row.
The main aim of the group by is to determine the aggregation functions of the features. Average and sum fractions are usually convenient for numerical features, whereas it is complicated for categorical data.
I suggest two ways of aggregating categorical columns
The first option is to choose the label with the highest frequency. In other words, this is the max operation for categorical columns, but ordinary max functions rarely return this value; instead, a lambda function is required.
data.groupby('id').agg(lambda x: x.value_counts().index[0])
After performing one-hot encoding, the second alternative is to use a group by function. This technique keeps all of the data and, in the meantime, converts the encoded column from categorical to numerical.
The numerical properties of the dataset, in most circumstances, do not have a fixed range and differ from one another. In reality, expecting the age and income columns to have the same range is absurd. But how can these two columns be compared from the standpoint of machine learning?
This issue is solved by scaling. After a scaling operation, the continuous features become similar in terms of range. Although this step is not a must for many algorithms, it’s still a good idea to do so. Distance-based algorithms like k-NN and k-Means, on the other hand, require scaled continuous features as model input.
All values are scaled in a specified range between 0 and 1 via normalisation (or min-max normalisation). This modification does not influence the feature’s distribution, but it does exacerbate the effects of outliers due to lower standard deviations. As a result, it’s a good idea to deal with outliers before normalisation.
data = pd.DataFrame({'feature':[2, 45, -23, 85, 28, 2, 35, -12]}) data['normalized'] = (data['feature'] - data['feature'].min()) / (data['feature'].max() - data['feature'].min()) value normalized 0 2 0.23 1 45 0.63 2 -23 0.00 3 85 1.00 4 28 0.47 5 2 0.23 6 35 0.54 7 -12 0.10
Standardization (also known as z-score normalisation) is the process of scaling values while accounting for standard deviation. If the standard deviation of features differs, the range of those features will likewise differ. The effect of outliers in the characteristics is reduced as a result.
data = pd.DataFrame({'feature':[2,45, -23, 85, 28, 2, 35, -12]}) data['standardized'] = (data['feature'] - data['feature'].mean()) / data['feature'].std() value standardized 0 2 -0.52 1 45 0.70 2 -23 -1.23 3 85 1.84 4 28 0.22 5 2 -0.52 6 35 0.42 7 -12 -0.92
Even though date columns typically give helpful information about the model goal, they are either ignored as an input or used in an illogical manner by machine learning algorithms. This may be because dates come in a variety of formats, making them difficult for algorithms to interpret, even when simplified to a format like “01–01–2020.”
If you don’t manipulate the date columns, it’s very difficult for a machine learning system to build an ordinal relationship between the data. Here are three forms of date preparation that I recommend:
When you convert the date column into the extracted columns, as shown above, the information contained inside them is revealed, and machine learning algorithms can readily comprehend it.
from datetime import date
data = pd.DataFrame({'date':['01-01-2017','04-12-2008','23-06-2010','25-08-2005','20-02-2020',]})
#Transform string to date
data['date'] = pd.to_datetime(data.date, format="%d-%m-%Y")
#Extracting Year
data['year'] = data['date'].dt.year
#Extracting Month
data['month'] = data['date'].dt.month
#Extracting passed years since the date
data['passed_years'] = date.today().year - data['date'].dt.year
#Extracting passed months since the date
data['passed_months'] = (date.today().year - data['date'].dt.year) * 12 + date.today().month - data['date'].dt.month
#Extracting the weekday name of the date
data['day_name'] = data['date'].dt.day_name()
date year month passed_years passed_months day_name
These techniques aren’t magical, so try out and get the key information from features that helps in better performance of the model.
I hope you’ve found this article useful, and that might help you in the feature engineering process.
Binning in feature engineering is like sorting data into groups to make it easier for computers to understand.
In image processing, feature engineering is about helping computers recognize important things in pictures, like edges, shapes, and colors. It’s like teaching computers to understand what’s in the images.
Yes, there are tools like Featuretools and TPOT that make feature engineering faster and easier
Thanks for the sharing of such information