30 Questions to test a Data Scientist on Natural Language Processing [Solution: Skilltest – NLP]

Shivam Bansal Last Updated : 17 Jan, 2025

8 min read

Humans are social beings, and language is how we connect with others. But what if machines could understand and respond to our language? Natural Language Processing (NLP) helps machines learn to understand and interpret human language.

We recently launched an NLP skill test with 817 participants to evaluate their knowledge of NLP. If you missed it, don’t worry! We’ve shared the questions and solutions to help you learn. These questions to test a data scientist Skills are valuable, whether or not you’ve taken an NLP course.

Skill Test Questions and Answers for Data Scientist

Q1 Which of the following techniques can be used for the purpose of keyword normalization, the process of converting a keyword into its base form?

Lemmatization
Levenshtein
Stemming
Soundex

A) 1 and 2

B) 2 and 4

C) 1 and 3

D) 1, 2 and 3

E) 2, 3 and 4

F) 1, 2, 3 and 4

Solution: (C)

Lemmatization and stemming are the techniques of keyword normalization, while Levenshtein and Soundex are techniques of string matching.

2) N-grams are defined as the combination of N keywords together. How many bi-grams can be generated from a given sentence:

“Analytics Vidhya is a great source to learn data science”

A) 7
B) 8
C) 9
D) 10
E) 11

Solution: (C)

Bigrams: Analytics Vidhya, Vidhya is, is a, a great, great source, source to, To learn, learn data, data science

3) How many trigrams phrases can be generated from the following sentence, after performing following text cleaning steps:

Stopword Removal
Replacing punctuations by a single space

A) 3
B) 4
C) 5
D) 6
E) 7

Solution: (C)

After performing stopword removal and punctuation replacement the text becomes: “Analytics vidhya great source learn data science”

Trigrams – Analytics vidhya great, vidhya great source, great source learn, source learn data, learn data science

4) Which of the following regular expression can be used to identify date(s) present in the text object:

The next meetup on data science will be held on 2017-09-21, previously it happened on 31/03, 2016”

A) \d{4}-\d{2}-\d{2}
B) (19|20)\d{2}-(0[1-9]|1[0-2])-[0-2][1-9]
C) (19|20)\d{2}-(0[1-9]|1[0-2])-([0-2][1-9]|3[0-1])
D) None of the above

Solution: (D)

None if these expressions would be able to identify the dates in this text object.

Question Context 5-6:

You have collected a data of about 10,000 rows of tweet text and no other information. You want to create a tweet classification model that categorizes each of the tweets in three buckets – positive, negative and neutral.

5) Which of the following models can perform tweet classification with regards to context mentioned above?

A) Naive Bayes
B) SVM
C) None of the above

Solution: (C)

Since, you are given only the data of tweets and no other information, which means there is no target variable present. One cannot train a supervised learning model, both svm and naive bayes are supervised learning techniques.

6) You have created a document term matrix of the data, treating every tweet as one document. Which of the following is correct, in regards to document term matrix?

Removal of stopwords from the data will affect the dimensionality of data
Normalization of words in the data will reduce the dimensionality of data
Converting all the words in lowercase will not affect the dimensionality of the data

A) Only 1
B) Only 2
C) Only 3
D) 1 and 2
E) 2 and 3
F) 1, 2 and 3

Solution: (D)

Choices A and B are correct because stopword removal will decrease the number of features in the matrix, normalization of words will also reduce redundant features, and, converting all words to lowercase will also decrease the dimensionality.

7) Which of the following features can be used for accuracy improvement of a classification model?

A) Frequency count of terms
B) Vector Notation of sentence
C) Part of Speech Tag
D) Dependency Grammar
E) All of these

Solution: (E)

All of the techniques can be used for the purpose of engineering features in a model.

8) What percentage of the total statements are correct with regards to Topic Modeling?

It is a supervised learning technique
LDA (Linear Discriminant Analysis) can be used to perform topic modeling
Selection of number of topics in a model does not depend on the size of data
Number of topic terms are directly proportional to size of the data

A) 0
B) 25
C) 50
D) 75
E) 100

Solution: (A)

LDA is unsupervised learning model, LDA is latent Dirichlet allocation, not Linear discriminant analysis. Selection of the number of topics is directly proportional to the size of the data, while number of topic terms is not directly proportional to the size of the data. Hence none of the statements are correct.

9) In Latent Dirichlet Allocation model for text classification purposes, what does alpha and beta hyperparameter represent-

A) Alpha: number of topics within documents, beta: number of terms within topics False
B) Alpha: density of terms generated within topics, beta: density of topics generated within terms False
C) Alpha: number of topics within documents, beta: number of terms within topics False
D) Alpha: density of topics generated within documents, beta: density of terms generated within topics True

Solution: (D)

Option D is correct

10) Solve the equation according to the sentence “I am planning to visit New Delhi to attend Analytics Vidhya Delhi Hackathon”.

A = (# of words with Noun as the part of speech tag)
B = (# of words with Verb as the part of speech tag)
C = (# of words with frequency count greater than one)

What are the correct values of A, B, and C?

A) 5, 5, 2
B) 5, 5, 0
C) 7, 5, 1
D) 7, 4, 2
E) 6, 4, 3

Solution: (D)

Nouns: I, New, Delhi, Analytics, Vidhya, Delhi, Hackathon (7)

Verbs: am, planning, visit, attend (4)

Words with frequency counts > 1: to, Delhi (2)

Hence option D is correct.

11) What is the correct value for the product of TF (term frequency) and IDF (inverse-document-frequency), if the term “data” appears in approximately one-third of the total documents?

A) KT * Log(3)
B) K * Log(3) / T
C) T * Log(3) / K
D) Log(3) / KT

Solution: (B)

formula for TF is K/T

formula for IDF is log(total docs / no of docs containing “data”)

= log(1 / (⅓))

= log (3)

Hence correct choice is Klog(3)/T

12) Which of the following documents contains the same number of terms and the number of terms in the one of the document is not equal to least number of terms in any document in the entire corpus.

A) d1 and d4
B) d6 and d7
C) d2 and d4
D) d5 and d6

Solution: (C)

Both of the documents d2 and d4 contains 4 terms and does not contain the least number of terms which is 3.

13) Which are the most common and the rarest term of the corpus?

A) t4, t6
B) t3, t5
C) t5, t1
D) t5, t6

Solution: (A)

T5 is most common terms across 5 out of 7 documents, T6 is rare term only appears in d3 and d4

14) What is the term frequency of a term which is used a maximum number of times in that document?

A) t6 – 2/5
B) t3 – 3/6
C) t4 – 2/6
D) t1 – 2/6

Solution: (B)

t3 is used max times in entire corpus = 3, tf for t3 is 3/6

15) Which of the following technique is not a part of flexible text matching?

A) Soundex
B) Metaphone
C) Edit Distance
D) Keyword Hashing

Solution: (D)

Except Keyword Hashing all other are the techniques used in flexible string matching

16) True or False: Word2Vec model is a machine learning model used to create vector notations of text objects. Word2vec contains multiple deep neural networks

A) TRUE
B) FALSE

Solution: (B)

Word2vec also contains preprocessing model which is not a deep neural network

17) Which of the following statement is(are) true for Word2Vec model?

A) The architecture of word2vec consists of only two layers – continuous bag of words and skip-gram model
B) Continuous bag of word (CBOW) is a Recurrent Neural Network model
C) Both CBOW and Skip-gram are shallow neural network models
D) All of the above

Solution: (C)

Word2vec contains the Continuous bag of words and skip-gram models, which are deep neural nets.

18) With respect to this context-free dependency graphs, how many sub-trees exists in the sentence?

A) 3
B) 4
C) 5
D) 6

Solution: (D)

Subtrees in the dependency graph can be viewed as nodes having an outward link, for example:

Media, networking, play, role, billions, and lives are the roots of subtrees

19) What is the right order for a text classification model components

Text cleaning
Text annotation
Gradient descent
Model tuning
Text to predictors

A) 12345
B) 13425
C) 12534
D) 13452

Solution: (C)

A right text classification model contains – cleaning of text to remove noise, annotation to create more features, converting text-based features into predictors, learning a model using gradient descent and finally tuning a model.

20) Polysemy is defined as the coexistence of multiple meanings for a word or phrase in a text object. Which of the following models is likely the best choice to correct this problem?

A) Random Forest Classifier
B) Convolutional Neural Networks
C) Gradient Boosting
D) All of these

Solution: (B)

CNNs are popular choice for text classification problems because they take into consideration left and right contexts of the words as features which can solve the problem of polysemy

21) Which of the following models can be used for the purpose of document similarity?

A) Training a word 2 vector model on the corpus that learns context present in the document
B) Training a bag of words model that learns occurrence of words in the document
C) Creating a document-term matrix and using cosine similarity for each document
D) All of the above

Solution: (D)

word2vec model can be used for measuring document similarity based on context. Bag Of Words and document term matrix can be used for measuring similarity based on terms.

22) What are the possible features of a text corpus

Count of word in a document
Boolean feature – presence of word in a document
Vector notation of word
Part of Speech Tag
Basic Dependency Grammar
Entire document as a feature

A) 1
B) 12
C) 123
D) 1234
E) 12345
F) 123456

Solution: (E)

Except for entire document as the feature, rest all can be used as features of text classification learning model.

23) While creating a machine learning model on text data, you created a document term matrix of the input data of 100K documents. Which of the following remedies can be used to reduce the dimensions of data –

Latent Dirichlet Allocation
Latent Semantic Indexing
Keyword Normalization

A) only 1
B) 2, 3
C) 1, 3
D) 1, 2, 3

Solution: (D)

All of the techniques can be used to reduce the dimensions of the data.

24) Google Search’s feature – “Did you mean”, is a mixture of different techniques. Which of the following techniques are likely to be ingredients?

Collaborative Filtering model to detect similar user behaviors (queries)
Model that checks for Levenshtein distance among the dictionary terms
Translation of sentences into multiple languages

A) 1
B) 2
C) 1, 2
D) 1, 2, 3

Solution: (C)

Collaborative filtering can be used to check what are the patterns used by people, Levenshtein is used to measure the distance among dictionary terms.

25) While working with text data obtained from news sentences, which are structured in nature, which of the grammar-based text parsing techniques can be used for noun phrase detection, verb phrase detection, subject detection and object detection.

A) Part of speech tagging
B) Dependency Parsing and Constituency Parsing
C) Skip Gram and N-Gram extraction
D) Continuous Bag of Words

Solution: (B)

Dependency and constituent parsing extract these relations from the text

A) Perform Topic Models to obtain most significant words of the corpus
B) Train a Bag of Ngrams model to capture top n-grams – words and their combinations
C) Train a word2vector model to learn repeating contexts in the sentences
D) All of these

Solution: (D)

All of the techniques can be used to extract most significant terms of a corpus.

27) While working with context extraction from a text data, you encountered two different sentences: The tank is full of soldiers. The tank is full of nitrogen. Which of the following measures can be used to remove the problem of word sense disambiguation in the sentences?

A) Compare the dictionary definition of an ambiguous word with the terms contained in its neighborhood
B) Co-reference resolution in which one resolute the meaning of ambiguous word with the proper noun present in the previous sentence
C) Use dependency parsing of sentence to understand the meanings

Solution: (A)

Option 1 is called Lesk algorithm, used for word sense disambiguation, rest others cannot be used.

28) Collaborative Filtering and Content Based Models are the two popular recommendation engines, what role does NLP play in building such algorithms.

A) Feature Extraction from text
B) Measuring Feature Similarity
C) Engineering Features for vector space learning model
D) All of these

Solution: (D)

NLP can be used anywhere where text data is involved – feature extraction, measuring feature similarity, create vector features of the text.

29) Retrieval based models and Generative models are the two popular techniques used for building chatbots. Which of the following is an example of retrieval model and generative model respectively.

A) Dictionary based learning and Word 2 vector model
B) Rule-based learning and Sequence to Sequence model
C) Word 2 vector and Sentence to Vector model
D) Recurrent neural network and convolutional neural network

Solution: (B)

choice 2 best explains examples of retrieval based models and generative models

30) What is the major difference between CRF (Conditional Random Field) and HMM (Hidden Markov Model)?

A) CRF is Generative whereas HMM is Discriminative model
B) CRF is Discriminative whereas HMM is Generative model
C) Both CRF and HMM are Generative model
D) Both CRF and HMM are Discriminative model

Solution: (B)

Option B is correct

End Notes

I tried my best to make the solutions as comprehensive as possible but if you have any questions/doubts please drop in your comments below. And I would love to hear the feedback about the skill test. Feel free to share them in comments below. For latest and upcoming skill test please refer to the DataHack platform of Analytics Vidhya.

And if you are just getting started with Natural Language Processing, check out most comprehensive programs on NLP-

Shivam Bansal

Shivam Bansal is a data scientist with exhaustive experience in Natural Language Processing and Machine Learning in several domains. He is passionate about learning and always looks forward to solving challenging analytical problems.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to NLP

Text Pre-processing

NLP Libraries

Regular Expressions

String Similarity

Spelling Correction

Topic Modeling

Text Representation

Information Retrieval System

Word Vectors

Word Senses

Dependency Parsing

Language Modeling

Getting Started with RNN

Different Variants of RNN

Machine Translation and Attention

Self Attention and Transformers

Transfomers and Pretraining

Question Answering

Text Summarization

Named Entity Recognition

Coreference Resolution

Audio Data

ASR

Audio Separation

Chatbot

Auto NLP

30 Questions to test a Data Scientist on Natural Language Processing [Solution: Skilltest – NLP]

Skill Test Questions and Answers for Data Scientist

Q1 Which of the following techniques can be used for the purpose of keyword normalization, the process of converting a keyword into its base form?

2) N-grams are defined as the combination of N keywords together. How many bi-grams can be generated from a given sentence:

3) How many trigrams phrases can be generated from the following sentence, after performing following text cleaning steps:

4) Which of the following regular expression can be used to identify date(s) present in the text object:

5) Which of the following models can perform tweet classification with regards to context mentioned above?

6) You have created a document term matrix of the data, treating every tweet as one document. Which of the following is correct, in regards to document term matrix?

7) Which of the following features can be used for accuracy improvement of a classification model?

8) What percentage of the total statements are correct with regards to Topic Modeling?

9) In Latent Dirichlet Allocation model for text classification purposes, what does alpha and beta hyperparameter represent-

10) Solve the equation according to the sentence “I am planning to visit New Delhi to attend Analytics Vidhya Delhi Hackathon”.

11) What is the correct value for the product of TF (term frequency) and IDF (inverse-document-frequency), if the term “data” appears in approximately one-third of the total documents?

12) Which of the following documents contains the same number of terms and the number of terms in the one of the document is not equal to least number of terms in any document in the entire corpus.

13) Which are the most common and the rarest term of the corpus?

14) What is the term frequency of a term which is used a maximum number of times in that document?

15) Which of the following technique is not a part of flexible text matching?

16) True or False: Word2Vec model is a machine learning model used to create vector notations of text objects. Word2vec contains multiple deep neural networks

17) Which of the following statement is(are) true for Word2Vec model?

18) With respect to this context-free dependency graphs, how many sub-trees exists in the sentence?

19) What is the right order for a text classification model components

20) Polysemy is defined as the coexistence of multiple meanings for a word or phrase in a text object. Which of the following models is likely the best choice to correct this problem?

21) Which of the following models can be used for the purpose of document similarity?

22) What are the possible features of a text corpus

23) While creating a machine learning model on text data, you created a document term matrix of the input data of 100K documents. Which of the following remedies can be used to reduce the dimensions of data –

24) Google Search’s feature – “Did you mean”, is a mixture of different techniques. Which of the following techniques are likely to be ingredients?

25) While working with text data obtained from news sentences, which are structured in nature, which of the grammar-based text parsing techniques can be used for noun phrase detection, verb phrase detection, subject detection and object detection.

26) Social Media platforms are the most intuitive form of text data. You are given a corpus of complete social media data of tweets. How can you create a model that suggests the hashtags?

27) While working with context extraction from a text data, you encountered two different sentences: The tank is full of soldiers. The tank is full of nitrogen. Which of the following measures can be used to remove the problem of word sense disambiguation in the sentences?

28) Collaborative Filtering and Content Based Models are the two popular recommendation engines, what role does NLP play in building such algorithms.

29) Retrieval based models and Generative models are the two popular techniques used for building chatbots. Which of the following is an example of retrieval model and generative model respectively.

30) What is the major difference between CRF (Conditional Random Field) and HMM (Hidden Markov Model)?

End Notes

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID