A Complete Guide on Feature Extraction Techniques

shankar297 Last Updated : 03 Jun, 2022

7 min read

This article was published as a part of the Data Science Blogathon.

Introduction on Feature Extraction

In Natural Language Processing, Feature Extraction is one of the most important steps to be followed for a better understanding of the context of what we are dealing with. After the initial text is cleaned, we need to transform it into its features to be used for modeling. Document data is not computable so it must be transformed into numerical data such as a vector space model. This transformation task is generally called feature extraction of document data. Feature Extraction is also called Text Representation, Text Extraction, or Text Vectorization.

In this article, we will explore different types of Feature Extraction Techniques like Bag of words, Tf-Idf, n-gram, word2vec, etc. Without wasting our time let’s start our article.

First, let us understand the answer to some questions:
1. What is Feature Extraction from the text?
2. Why do we need it?
3. Why is it difficult?
4. What are the techniques?

1. What is Feature Extraction from the text?

If we have textual data, that data we can not feed to any machine learning algorithm because the Machine Learning algorithm doesn’t understand text data.
It understands only numerical data. The process of converting text data into numbers is called Feature Extraction from the text. It is also called text vectorization.

2. Why do we Need it?
So we know that machines can only understand numbers and to make machines able to identify language we need to convert it into numeric form.

3. Why is it Difficult?
If we ask any NLP practitioner or data scientist then the answer will be yes, somewhat it is difficult.
Now let us compare text feature extraction with feature extraction in other types of data.
So in an image dataset, image feature extraction is easy because images are already present in form of numbers(Pixels).
If we talk about audio data, suppose emotion prediction from speech recognition so, in this, we have data in form of waveform signals where features can be extracted over some time Interval.
But when we have a sentence and we want to predict its sentiment, How will you represent it in numbers? In this article, we are going to study these techniques.

4. What are the Techniques?
1. One Hot Encoding
2. Bag of Word (BOW)
3. n-grams
4. Tf-Idf
5. Custom features
6. Word2Vec(Word Embedding)

Common Terms Used

Corpus(c) — The total number of words present in the whole dataset is known as Corpus.
Vocabulary (V) — Total number of unique words available in the corpus.
Document (D) — There are multiple records in a dataset so a single record or review is referred to as a document.
Word (w) — Words that are used in a document are known as Word.

_{Techniques for Feature Extraction}

_1._{One Hot Encoding}

One hot encoding means converting the words of your document into a V-dimension vector.
This technique is very intuitive means it is simple and you can code it yourself. This is only the advantage of One-Hot Encoding.
Let’s say we have documents “We are learning Natural Language Processing”, “We are learning Data Science”, and “Natural Language Processing comes under Data Science”.

corpus – We are learning Natural Language Processing, We are learning Data Science, Natural Language Processing comes under Data Science

Vocabulary(Unique words) – We are learning Natural Language Processing Data Science comes under – V – 10

Document1 — We are learning Natural Language Processing

Advantage

1. It is Intuitive
2. Easy to implement

One hot encoding is not used in the industry because it has flaws.

Disadvantage

1. It creates Sparsity.
2. Size of each document after one hot encoding may be different. The machine learning model doesn’t work.
3. Out of Vocabulary (OOV) problem. At the time of prediction new word come which is not available in the vocabulary.
4. No capturing of semantic meaning.

_{2. Bag of Words}

It is one of the most used text vectorization techniques. A bag-of-words is a representation of text that describes the occurrence of words within a document.
Specially used in the Text Classification task.

We can directly use CountVectorizer class by Scikit-learn.

code example

Advantage

1. Simple and intuitive.
2. Size of each document after BOW same.
3. Out of Vocabulary (OOV) problem does not occur, which means the model does not give an error.

Disadvantage

1. BOW also creates Sparsity.
2. OOV, Ignoring the new word.
3. Not consider sentence ordering issues.

_{3. Bag of n-grams}

A bag-of-n -grams model represents a text document as an unordered collection of its n-grams.

bi-gram — using two words of the document
tri-gram — using three words of the document
n-gram — using n number of words of the document

code example

Advantage

1. Simple and easy to implement.
2. Able to capture the semantic meaning of the sentence.

Disadvantage

1. As we move from unigram to N-Gram then the dimension of vector formation increases and slows down the algorithm.
2. OOV, Ignoring the new word.

_{4. Tf-Idf — Term Frequency and Inverse Document Frequency}

TF-IDF (term frequency-inverse document frequency) is a statistical measure that evaluates how relevant a word is to a document in a collection of documents.

Term Frequency (TF):
The number of times a word appears in a document is divided by the total number of words in that document. 0 < Tf< 1

Inverse Document Frequency (IDF):
The logarithm of the number of documents in the corpus is divided by the number of documents where the specific term appears. In Scikit-learn use log(N/ni) + 1 formula.

code example

Q. Why do we take a log to calculate IDF?

If we have a very rare word, the IDF value without a log is very high. And then we have to calculate Tf * If at that time Idf value will dominate the Tf value because the Tf value lies from 0 to 1. That means we normalize the IDF value using a log.

Advantage

1. This technique is widely used in Information retrieval like a search engine.

Disadvantage

1. Sparsity
2. If we have a large dataset then dimensionality increases, slowing down algos.
3. OOV, Ignoring the new word.
4. Semantic meaning does not capture.

_{5. Custom Features}

Creating new custom features using domain knowledge.

examples
1. Number of a word in the document.
2. Number of negative words in the document.
3. Ratio of +ve review to -ve review.
4. Word count.
5. Character count.

_{6. Word2vec}

Word Embeddings

Wikipedia says “In natural language processing, word embedding is a term used for the representation of words for text analysis, typically in the form of a real-valued vector that encodes the meaning of the word such that the words that are closer in the vector space are expected to be similar in meaning.”

let’s consider one example boy-man vs boy-table, Can you tell which of the pair has more similar words to each other?

For a human, it’s easy to understand the associations between words in a language. We know that boy and man have more similar meanings than boy and table but what if we want machines to understand this kind of relation automatically in our languages as well? That is what word embeddings come into the picture.

Types of Word Embedding

1. Frequency-based – Count frequency of word

BOW
Tf-idf
Glove(based on Matric Factorization)

2. Prediction based

Word2Vec

What is Word2Vec?

Word2Vec is somewhat different than other techniques which we discussed earlier because it is a Deep learning-based technique.

Word2Vec is a word embedding technique, that converts a given word into a vector as a collection of numbers.

As we have other techniques then why do we need word2vec?

Word2vec capture semantic meaning like happiness and joy have the same meaning.
Word2vec create low dimension vector(each word is a collection of a range of 200 to 300.
Word2vec creates a Dense vector(non-zeros)

How to Use It?

We have two approaches to use Word2Vec
1. Use a pre-trained model
2. Self-Trained model

Conclusion

In this article, we learned about different types of feature extraction techniques. The key takeaways from the article are,

We learned different types of feature extraction techniques such as one-hot encoding, bag of words, TF-IDF, word2vec, etc.
One Hot Encoding is a simple technique giving each unique word zero or one.
A bag-of-words is a representation of text that describes the occurrence of words within a document.
TF-IDF is a statistical measure that evaluates how relevant a word is to a document in a collection of documents.
and at the last we have seenWord2Vec. Word2Vec is a word embedding technique, that converts a given word into a vector as a collection of numbers.
Each technique has its pros and cons.

So, this was all about feature extraction techniques. Hope you liked the article.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

shankar297

Hi, I am shankar working as data engineer, I love to play with data.
My passion for data science and my expertise as a data engineer make me a valuable asset in driving data-centric projects and leveraging the power of data to solve real-world problems.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to NLP

Text Pre-processing

NLP Libraries

Regular Expressions

String Similarity

Spelling Correction

Topic Modeling

Text Representation

Information Retrieval System

Word Vectors

Word Senses

Dependency Parsing

Language Modeling

Getting Started with RNN

Different Variants of RNN

Machine Translation and Attention

Self Attention and Transformers

Transfomers and Pretraining

Question Answering

Text Summarization

Named Entity Recognition

Coreference Resolution

Audio Data

ASR

Audio Separation

Chatbot

Auto NLP

A Complete Guide on Feature Extraction Techniques

Introduction on Feature Extraction

Common Terms Used

Techniques for Feature Extraction

1. One Hot Encoding

Advantage

Disadvantage

2. Bag of Words

Advantage

Disadvantage

3. Bag of n-grams

Advantage

Disadvantage

4. Tf-Idf — Term Frequency and Inverse Document Frequency

Advantage

Disadvantage

5. Custom Features

6. Word2vec

Types of Word Embedding

What is Word2Vec?

How to Use It?

Conclusion

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

_{Techniques for Feature Extraction}

_1._{One Hot Encoding}

_{2. Bag of Words}

_{3. Bag of n-grams}

_{4. Tf-Idf — Term Frequency and Inverse Document Frequency}

_{5. Custom Features}

_{6. Word2vec}