Humans are social beings, and language is how we connect with others. But what if machines could understand and respond to our language? Natural Language Processing (NLP) helps machines learn to understand and interpret human language.
We recently launched an NLP skill test with 817 participants to evaluate their knowledge of NLP. If you missed it, don’t worry! We’ve shared the questions and solutions to help you learn. These questions to test a data scientist Skills are valuable, whether or not you’ve taken an NLP course.
A) 1 and 2
B) 2 and 4
C) 1 and 3
D) 1, 2 and 3
E) 2, 3 and 4
F) 1, 2, 3 and 4
Solution: (C)
Lemmatization and stemming are the techniques of keyword normalization, while Levenshtein and Soundex are techniques of string matching.
“Analytics Vidhya is a great source to learn data science”
A) 7
B) 8
C) 9
D) 10
E) 11
Solution: (C)
Bigrams: Analytics Vidhya, Vidhya is, is a, a great, great source, source to, To learn, learn data, data science
A) 3
B) 4
C) 5
D) 6
E) 7
Solution: (C)
After performing stopword removal and punctuation replacement the text becomes: “Analytics vidhya great source learn data science”
Trigrams – Analytics vidhya great, vidhya great source, great source learn, source learn data, learn data science
The next meetup on data science will be held on 2017-09-21, previously it happened on 31/03, 2016”
A) \d{4}-\d{2}-\d{2}
B) (19|20)\d{2}-(0[1-9]|1[0-2])-[0-2][1-9]
C) (19|20)\d{2}-(0[1-9]|1[0-2])-([0-2][1-9]|3[0-1])
D) None of the above
Solution: (D)
None if these expressions would be able to identify the dates in this text object.
Question Context 5-6:
You have collected a data of about 10,000 rows of tweet text and no other information. You want to create a tweet classification model that categorizes each of the tweets in three buckets – positive, negative and neutral.
A) Naive Bayes
B) SVM
C) None of the above
Solution: (C)
Since, you are given only the data of tweets and no other information, which means there is no target variable present. One cannot train a supervised learning model, both svm and naive bayes are supervised learning techniques.
A) Only 1
B) Only 2
C) Only 3
D) 1 and 2
E) 2 and 3
F) 1, 2 and 3
Solution: (D)
Choices A and B are correct because stopword removal will decrease the number of features in the matrix, normalization of words will also reduce redundant features, and, converting all words to lowercase will also decrease the dimensionality.
A) Frequency count of terms
B) Vector Notation of sentence
C) Part of Speech Tag
D) Dependency Grammar
E) All of these
Solution: (E)
All of the techniques can be used for the purpose of engineering features in a model.
A) 0
B) 25
C) 50
D) 75
E) 100
Solution: (A)
LDA is unsupervised learning model, LDA is latent Dirichlet allocation, not Linear discriminant analysis. Selection of the number of topics is directly proportional to the size of the data, while number of topic terms is not directly proportional to the size of the data. Hence none of the statements are correct.
A) Alpha: number of topics within documents, beta: number of terms within topics False
B) Alpha: density of terms generated within topics, beta: density of topics generated within terms False
C) Alpha: number of topics within documents, beta: number of terms within topics False
D) Alpha: density of topics generated within documents, beta: density of terms generated within topics True
Solution: (D)
Option D is correct
A = (# of words with Noun as the part of speech tag)
B = (# of words with Verb as the part of speech tag)
C = (# of words with frequency count greater than one)
What are the correct values of A, B, and C?
A) 5, 5, 2
B) 5, 5, 0
C) 7, 5, 1
D) 7, 4, 2
E) 6, 4, 3
Solution: (D)
Nouns: I, New, Delhi, Analytics, Vidhya, Delhi, Hackathon (7)
Verbs: am, planning, visit, attend (4)
Words with frequency counts > 1: to, Delhi (2)
Hence option D is correct.
A) KT * Log(3)
B) K * Log(3) / T
C) T * Log(3) / K
D) Log(3) / KT
Solution: (B)
formula for TF is K/T
formula for IDF is log(total docs / no of docs containing “data”)
= log(1 / (⅓))
= log (3)
Hence correct choice is Klog(3)/T
A) d1 and d4
B) d6 and d7
C) d2 and d4
D) d5 and d6
Solution: (C)
Both of the documents d2 and d4 contains 4 terms and does not contain the least number of terms which is 3.
A) t4, t6
B) t3, t5
C) t5, t1
D) t5, t6
Solution: (A)
T5 is most common terms across 5 out of 7 documents, T6 is rare term only appears in d3 and d4
A) t6 – 2/5
B) t3 – 3/6
C) t4 – 2/6
D) t1 – 2/6
Solution: (B)
t3 is used max times in entire corpus = 3, tf for t3 is 3/6
A) Soundex
B) Metaphone
C) Edit Distance
D) Keyword Hashing
Solution: (D)
Except Keyword Hashing all other are the techniques used in flexible string matching
A) TRUE
B) FALSE
Solution: (B)
Word2vec also contains preprocessing model which is not a deep neural network
A) The architecture of word2vec consists of only two layers – continuous bag of words and skip-gram model
B) Continuous bag of word (CBOW) is a Recurrent Neural Network model
C) Both CBOW and Skip-gram are shallow neural network models
D) All of the above
Solution: (C)
Word2vec contains the Continuous bag of words and skip-gram models, which are deep neural nets.
A) 3
B) 4
C) 5
D) 6
Solution: (D)
Subtrees in the dependency graph can be viewed as nodes having an outward link, for example:
Media, networking, play, role, billions, and lives are the roots of subtrees
A) 12345
B) 13425
C) 12534
D) 13452
Solution: (C)
A right text classification model contains – cleaning of text to remove noise, annotation to create more features, converting text-based features into predictors, learning a model using gradient descent and finally tuning a model.
A) Random Forest Classifier
B) Convolutional Neural Networks
C) Gradient Boosting
D) All of these
Solution: (B)
CNNs are popular choice for text classification problems because they take into consideration left and right contexts of the words as features which can solve the problem of polysemy
A) Training a word 2 vector model on the corpus that learns context present in the document
B) Training a bag of words model that learns occurrence of words in the document
C) Creating a document-term matrix and using cosine similarity for each document
D) All of the above
Solution: (D)
word2vec model can be used for measuring document similarity based on context. Bag Of Words and document term matrix can be used for measuring similarity based on terms.
A) 1
B) 12
C) 123
D) 1234
E) 12345
F) 123456
Solution: (E)
Except for entire document as the feature, rest all can be used as features of text classification learning model.
A) only 1
B) 2, 3
C) 1, 3
D) 1, 2, 3
Solution: (D)
All of the techniques can be used to reduce the dimensions of the data.
A) 1
B) 2
C) 1, 2
D) 1, 2, 3
Solution: (C)
Collaborative filtering can be used to check what are the patterns used by people, Levenshtein is used to measure the distance among dictionary terms.
A) Part of speech tagging
B) Dependency Parsing and Constituency Parsing
C) Skip Gram and N-Gram extraction
D) Continuous Bag of Words
Solution: (B)
Dependency and constituent parsing extract these relations from the text
A) Perform Topic Models to obtain most significant words of the corpus
B) Train a Bag of Ngrams model to capture top n-grams – words and their combinations
C) Train a word2vector model to learn repeating contexts in the sentences
D) All of these
Solution: (D)
All of the techniques can be used to extract most significant terms of a corpus.
A) Compare the dictionary definition of an ambiguous word with the terms contained in its neighborhood
B) Co-reference resolution in which one resolute the meaning of ambiguous word with the proper noun present in the previous sentence
C) Use dependency parsing of sentence to understand the meanings
Solution: (A)
Option 1 is called Lesk algorithm, used for word sense disambiguation, rest others cannot be used.
A) Feature Extraction from text
B) Measuring Feature Similarity
C) Engineering Features for vector space learning model
D) All of these
Solution: (D)
NLP can be used anywhere where text data is involved – feature extraction, measuring feature similarity, create vector features of the text.
A) Dictionary based learning and Word 2 vector model
B) Rule-based learning and Sequence to Sequence model
C) Word 2 vector and Sentence to Vector model
D) Recurrent neural network and convolutional neural network
Solution: (B)
choice 2 best explains examples of retrieval based models and generative models
A) CRF is Generative whereas HMM is Discriminative model
B) CRF is Discriminative whereas HMM is Generative model
C) Both CRF and HMM are Generative model
D) Both CRF and HMM are Discriminative model
Solution: (B)
Option B is correct
I tried my best to make the solutions as comprehensive as possible but if you have any questions/doubts please drop in your comments below. And I would love to hear the feedback about the skill test. Feel free to share them in comments below. For latest and upcoming skill test please refer to the DataHack platform of Analytics Vidhya.
And if you are just getting started with Natural Language Processing, check out most comprehensive programs on NLP-
nice :)
Thanks.
Though solution of question 11 is correct, it would be more appropriate to start IDF value as "log(N/(1/3)N)" instead of "log(1/(1/3))" to make it clear to readers!
Thanks this clears the confusion