Humans are social animals and language is our primary tool to communicate with the society. But, what if machines could understand our language and then act accordingly? Natural Language Processing (NLP) is the science of teaching machines how to understand the language we humans speak and write.
We recently launched an NLP skill test on which a total of 817 people registered. This skill test was designed to test your knowledge of Natural Language Processing. If you are one of those who missed out on this skill test, here are the questions and solutions. We encourage you t go through them irrespective of whether you have gone through any NLP Program or not.
Here are the leaderboard ranking for all the participants.
Below are the distribution scores, they will help you evaluate your performance.
You can access the scores here. More than 250 people participated in the skill test and the highest score obtained was 24.
Here are some resources to get in-depth knowledge of the subject.
Ultimate Guide to Understand & Implement Natural Language Processing (with codes in Python)
And if you are just getting started with Natural Language Processing, check out most comprehensive programs on NLP-
Q1 Which of the following techniques can be used for the purpose of keyword normalization, the process of converting a keyword into its base form?
A) 1 and 2
B) 2 and 4
C) 1 and 3
D) 1, 2 and 3
E) 2, 3 and 4
F) 1, 2, 3 and 4
Solution: (C)
Lemmatization and stemming are the techniques of keyword normalization, while Levenshtein and Soundex are techniques of string matching.
&nsbp;
2) N-grams are defined as the combination of N keywords together. How many bi-grams can be generated from a given sentence:
“Analytics Vidhya is a great source to learn data science”
A) 7
B) 8
C) 9
D) 10
E) 11
Solution: (C)
Bigrams: Analytics Vidhya, Vidhya is, is a, a great, great source, source to, To learn, learn data, data science
3) How many trigrams phrases can be generated from the following sentence, after performing following text cleaning steps:
“#Analytics-vidhya is a great source to learn @data_science.”
A) 3
B) 4
C) 5
D) 6
E) 7
Solution: (C)
After performing stopword removal and punctuation replacement the text becomes: “Analytics vidhya great source learn data science”
Trigrams – Analytics vidhya great, vidhya great source, great source learn, source learn data, learn data science
4) Which of the following regular expression can be used to identify date(s) present in the text object:
Solution: (D)
None if these expressions would be able to identify the dates in this text object.
Question Context 5-6:
You have collected a data of about 10,000 rows of tweet text and no other information. You want to create a tweet classification model that categorizes each of the tweets in three buckets – positive, negative and neutral.
5) Which of the following models can perform tweet classification with regards to context mentioned above?
Solution: (C)
Since, you are given only the data of tweets and no other information, which means there is no target variable present. One cannot train a supervised learning model, both svm and naive bayes are supervised learning techniques.
6) You have created a document term matrix of the data, treating every tweet as one document. Which of the following is correct, in regards to document term matrix?
A) Only 1
B) Only 2
C) Only 3
D) 1 and 2
E) 2 and 3
F) 1, 2 and 3
Solution: (D)
Choices A and B are correct because stopword removal will decrease the number of features in the matrix, normalization of words will also reduce redundant features, and, converting all words to lowercase will also decrease the dimensionality.
7) Which of the following features can be used for accuracy improvement of a classification model?
All of the techniques can be used for the purpose of engineering features in a model.
8) What percentage of the total statements are correct with regards to Topic Modeling?
A) 0
B) 25
C) 50
D) 75
E) 100
Solution: (A)
LDA is unsupervised learning model, LDA is latent Dirichlet allocation, not Linear discriminant analysis. Selection of the number of topics is directly proportional to the size of the data, while number of topic terms is not directly proportional to the size of the data. Hence none of the statements are correct.
9) In Latent Dirichlet Allocation model for text classification purposes, what does alpha and beta hyperparameter represent-
Option D is correct
10) Solve the equation according to the sentence “I am planning to visit New Delhi to attend Analytics Vidhya Delhi Hackathon”.
A) 5, 5, 2
B) 5, 5, 0
C) 7, 5, 1
D) 7, 4, 2
E) 6, 4, 3
Solution: (D)
Nouns: I, New, Delhi, Analytics, Vidhya, Delhi, Hackathon (7)
Verbs: am, planning, visit, attend (4)
Words with frequency counts > 1: to, Delhi (2)
Hence option D is correct.
11) In a corpus of N documents, one document is randomly picked. The document contains a total of T terms and the term “data” appears K times.
What is the correct value for the product of TF (term frequency) and IDF (inverse-document-frequency), if the term “data” appears in approximately one-third of the total documents?
formula for TF is K/T
formula for IDF is log(total docs / no of docs containing “data”)
= log(1 / (⅓))
= log (3)
Hence correct choice is Klog(3)/T
Question Context 12 to 14:
Refer the following document term matrix
12) Which of the following documents contains the same number of terms and the number of terms in the one of the document is not equal to least number of terms in any document in the entire corpus.
Both of the documents d2 and d4 contains 4 terms and does not contain the least number of terms which is 3.
13) Which are the most common and the rarest term of the corpus?
T5 is most common terms across 5 out of 7 documents, T6 is rare term only appears in d3 and d4
14) What is the term frequency of a term which is used a maximum number of times in that document?
t3 is used max times in entire corpus = 3, tf for t3 is 3/6
15) Which of the following technique is not a part of flexible text matching?
Except Keyword Hashing all other are the techniques used in flexible string matching
16) True or False: Word2Vec model is a machine learning model used to create vector notations of text objects. Word2vec contains multiple deep neural networks
Word2vec also contains preprocessing model which is not a deep neural network
17) Which of the following statement is(are) true for Word2Vec model?
Word2vec contains the Continuous bag of words and skip-gram models, which are deep neural nets.
18) With respect to this context-free dependency graphs, how many sub-trees exists in the sentence?
Subtrees in the dependency graph can be viewed as nodes having an outward link, for example:
Media, networking, play, role, billions, and lives are the roots of subtrees
19) What is the right order for a text classification model components
A) 12345
B) 13425
C) 12534
D) 13452
Solution: (C)
A right text classification model contains – cleaning of text to remove noise, annotation to create more features, converting text-based features into predictors, learning a model using gradient descent and finally tuning a model.
20) Polysemy is defined as the coexistence of multiple meanings for a word or phrase in a text object. Which of the following models is likely the best choice to correct this problem?
CNNs are popular choice for text classification problems because they take into consideration left and right contexts of the words as features which can solve the problem of polysemy
21) Which of the following models can be used for the purpose of document similarity?
word2vec model can be used for measuring document similarity based on context. Bag Of Words and document term matrix can be used for measuring similarity based on terms.
22) What are the possible features of a text corpus
A) 1
B) 12
C) 123
D) 1234
E) 12345
F) 123456
Solution: (E)
Except for entire document as the feature, rest all can be used as features of text classification learning model.
23) While creating a machine learning model on text data, you created a document term matrix of the input data of 100K documents. Which of the following remedies can be used to reduce the dimensions of data –
A) only 1
B) 2, 3
C) 1, 3
D) 1, 2, 3
Solution: (D)
All of the techniques can be used to reduce the dimensions of the data.
24) Google Search’s feature – “Did you mean”, is a mixture of different techniques. Which of the following techniques are likely to be ingredients?
A) 1
B) 2
C) 1, 2
D) 1, 2, 3
Solution: (C)
Collaborative filtering can be used to check what are the patterns used by people, Levenshtein is used to measure the distance among dictionary terms.
25) While working with text data obtained from news sentences, which are structured in nature, which of the grammar-based text parsing techniques can be used for noun phrase detection, verb phrase detection, subject detection and object detection.
Dependency and constituent parsing extract these relations from the text
26) Social Media platforms are the most intuitive form of text data. You are given a corpus of complete social media data of tweets. How can you create a model that suggests the hashtags?
All of the techniques can be used to extract most significant terms of a corpus.
27) While working with context extraction from a text data, you encountered two different sentences: The tank is full of soldiers. The tank is full of nitrogen. Which of the following measures can be used to remove the problem of word sense disambiguation in the sentences?
Option 1 is called Lesk algorithm, used for word sense disambiguation, rest others cannot be used.
28) Collaborative Filtering and Content Based Models are the two popular recommendation engines, what role does NLP play in building such algorithms.
NLP can be used anywhere where text data is involved – feature extraction, measuring feature similarity, create vector features of the text.
29) Retrieval based models and Generative models are the two popular techniques used for building chatbots. Which of the following is an example of retrieval model and generative model respectively.
choice 2 best explains examples of retrieval based models and generative models
30) What is the major difference between CRF (Conditional Random Field) and HMM (Hidden Markov Model)?
Option B is correct
I tried my best to make the solutions as comprehensive as possible but if you have any questions/doubts please drop in your comments below. And I would love to hear the feedback about the skill test. Feel free to share them in comments below. For latest and upcoming skill test please refer to the DataHack platform of Analytics Vidhya.
If you want to learn more about Natural Language Processing and how it is implemented in Python, then check out our video course on NLP using Python.
Happy Learning!
nice :)
Thanks.
Though solution of question 11 is correct, it would be more appropriate to start IDF value as "log(N/(1/3)N)" instead of "log(1/(1/3))" to make it clear to readers!
Thanks this clears the confusion