NLP or Natural Language Processing is the science of processing, understanding, and generating human language by machines. Using NLP, information can be extracted from unstructured data, trained to generate responses for human queries, classify text into appropriate categories. News articles, social media posts, online reviews are some of the publicly available sources that are rich in information. NLP is used to derive meaningful insights from these sources but training NLP algorithms directly on the text, in its free form, can induce a lot of noise and add unnecessary complexity. To derive meaningful insights from such unstructured data, it needs to be cleansed, brought to an appropriate level for analysis.
This article covers some of the widely used preprocessing steps and provides an understanding of the structure and vocabulary of the text, along with their code in python. The exact list of steps depends on the quality of the text, the objective of the study, and the NLP task to be performed.
Document 1 – Natural Language Processing (NLP) is a field within Artificial Intelligence (AI) that is concerned with how computers deal with human language. We are already interacting with such machines in our day-to-day life in the form of IVRs & chat-bots. But do Machines really understand human language, context, syntax, semantics, etc.? Yes, they can be trained to do so!
Document 2 – Google has trained its search engine to make autofill recommendations as text is typed using NLP. Google’s search engine has the capability of
understanding the meaning of words depending on the context in the search. Google’s “state-of-the-art” search engine is one of the most sophisticated examples of NLP.
Document 3 – Origination of Natural Language Processing dates back to the II world war when there was a need for machine translation between Russian & English. Today, NLP has expanded beyond these two languages and can deal with most languages, including sign language.
Document 4 – NLP is actively used for a variety of day-to-day activities like spam detection, recruitment, smart assistants, understanding customer behaviour & so on…… Usage and impact of NLP are growing exponentially, across a wide range of industries.
Document 5 – Acronym NLP is used for both Natural Language Processing & Neuro-Linguistic Programming, but, these are completely different fields of science. Neuro-Linguistic Programming deals with human-to-human interaction, whereas Natural Language Processing deals with human-to-computer interaction.
Loading above in a pandas data frame in python
Document ID Text 0 1 Natural Language Processing (NLP) is a field w... 1 2 Google has trained it's search engine to make ... 2 3 Origination of Natural Language Proces... 3 4 NLP is actively used for a variety of day-to-d... 4 5 Acronym NLP is used for both Natural Languag...
Breaking paragraphs into sentences make them manageable as well as the context within the sentence can be better understood.
In the first document, the first couple of sentences are informative, followed by a question in raising tone. The tone of the individual sentence gets masked, when looking at the complete extract and cannot be deciphered when further broken into tokens.
from nltk.tokenize import sent_tokenize # breaking every document into sentences doc_w_sent = [sent_tokenize(text) for text in text_doc.text] # creating document ID & sentence ID for reference doc_num_list = [[x] * y for x, y in zip(text_doc.doc_id, [len(doc) for doc in doc_w_sent])] sentence_num_list = [list(range(1, len(doc)+1)) for doc in doc_w_sent] # un-nesting lists doc_w_sent = [x for element in doc_w_sent for x in element] doc_num_list = [x for element in doc_num_list for x in element] sentence_num_list = [x for element in sentence_num_list for x in element] # creating dataframe text_data = pd.DataFrame({'Document ID' : doc_num_list, 'Sentence ID' : sentence_num_list, 'Text' : doc_w_sent}) print(text_data)
NLTK is a python package for NLP tasks. As the name suggests, ‘sent_tokenize’ breaks paragraphs into sentences based on the end of line punctuation marks – period, question mark, and exclamation mark.
Output: Document ID Sentence ID Text 0 1 1 Natural Language Processing (NLP) is a field w... 1 1 2 We are already interacting with such machines ... 2 1 3 But do Machines really understand human langua... 3 1 4 Yes, they can be trained to do so! 4 2 1 Google has trained it's search engine to make ... 5 2 2 Google's search engine has the capability of ... 6 2 3 Google's "state-of-the-art" search engine is o... 7 3 1 Origination of Natural Language Proces... 8 3 2 Today, NLP has expanded beyond these two langu... 9 4 1 NLP is actively used for a variety of day-to-d... 10 4 2 Usage and impact of NLP is growing exponential... 11 5 1 Acronym NLP is used for both Natural Languag... 12 5 2 Neuro Linguistic Programming deals with human-...
A list is created with each sentence as an individual list element. Through “Document ID” and “Sentence ID”, it can be inferred that documents 1 through 5 contain 4, 3, 2, 2, 2 sentences respectively.
Part of Speech tagging is the process of assigning labels (part of speech) to words in a sentence given the context it’s used in and on the meaning of the word. It’s critical for Named Entity Recognition (NER), understanding the relationship between words, developing linguistic rules, lemmatization.
import nltk pos_tag = [nltk.pos_tag(nltk.word_tokenize(sent)) for sent in text_data.Text] print(pos_tag)
NLTK has part of speech tagging and is run on every word in the sentence.
[[('Natural', 'JJ'), ('Language', 'NNP'), ('Processing', 'NNP'), ('(', '('), ('NLP', 'NNP'), (')', ')'), ('is', 'VBZ'), ('a', 'DT'), ('field', 'NN'), ('within', 'IN'), ('Artificial', 'JJ'), ('Intelligence', 'NNP'), ('(', '('), ('AI', 'NNP'), (')', ')'), ('that', 'WDT'), ('is', 'VBZ'), ('concerned', 'VBN'), ('with', 'IN'), ('how', 'WRB'), ('computers', 'NNS'), ('deal', 'VBP'), ('with', 'IN'), ('human', 'JJ'), ('language', 'NN'), ('.', '.')], [('We', 'PRP'), ('are', 'VBP'), ('already', 'RB'), ('interacting', 'VBG'), ('with', 'IN'), ('such', 'JJ'), ('machines', 'NNS'), ('in', 'IN'), ('our', 'PRP$'), ('day-to-day', 'JJ'), ('life', 'NN'), ('in', 'IN'), ('the', 'DT'), ('form', 'NN'), ('of', 'IN'), ('IVRs', 'NNP'), ('&', 'CC'), ('chat-bots', 'NNS'), ('.', '.')], [('But', 'CC'), ('do', 'VBP'), ('Machines', 'NNS'), ('really', 'RB'), ('understand', 'VBP'), ('human', 'JJ'), ('language', 'NN'), (',', ','), ('context', 'NN'), (',', ','), ('syntax', 'NN'), (',', ','), ('semantics', 'NNS'), (',', ','), ('etc', 'FW'), ('.', '.'), ('?', '.')], [('Yes', 'UH'), (',', ','), ('they', 'PRP'), ('can', 'MD'), ('be', 'VB'), ('trained', 'VBN'), ('to', 'TO'), ('do', 'VB'), ('so', 'RB'), ('!', '.')], [('Google', 'NNP'), ('has', 'VBZ'), ('trained', 'VBN'), ('it', 'PRP'), ("'s", 'VBZ'), ('search', 'JJ'), ('engine', 'NN'), ('to', 'TO'), ('make', 'VB'), ('autofill', 'JJ'), ('recommendations', 'NNS'), ('as', 'IN'), ('text', 'NN'), ('is', 'VBZ'), ('typed', 'VBN'), ('using', 'VBG'), ('NLP', 'NNP'), ('.', '.')], [('Google', 'NNP'), ("'s", 'POS'), ('search', 'NN'), ('engine', 'NN'), ('has', 'VBZ'), ('the', 'DT'), ('capability', 'NN'), ('of', 'IN'), ('understanding', 'VBG'), ('meaning', 'NN'), ('of', 'IN'), ('words', 'NNS'), ('depending', 'VBG'), ('on', 'IN'), ('the', 'DT'), ('context', 'NN'), ('in', 'IN'), ('the', 'DT'), ('search', 'NN'), ('.', '.')], [('Google', 'NNP'), ("'s", 'POS'), ('``', '``'), ('state-of-the-art', 'JJ'), ("''", "''"), ('search', 'NN'), ('engine', 'NN'), ('is', 'VBZ'), ('one', 'CD'), ('of', 'IN'), ('the', 'DT'), ('most', 'RBS'), ('sophisticated', 'JJ'), ('examples', 'NNS'), ('of', 'IN'), ('NLP', 'NNP'), ('.', '.')], [('Origination', 'NN'), ('of', 'IN'), ('Natural', 'NNP'), ('Language', 'NNP'), ('Processing', 'NNP'), ('dates', 'VBZ'), ('back', 'RB'), ('to', 'TO'), ('II', 'NNP'), ('world', 'NN'), ('war', 'NN'), (',', ','), ('when', 'WRB'), ('there', 'EX'), ('was', 'VBD'), ('a', 'DT'), ('need', 'NN'), ('for', 'IN'), ('machine', 'NN'), ('translation', 'NN'), ('between', 'IN'), ('Russian', 'NNP'), ('&', 'CC'), ('English', 'NNP'), ('.', '.')], [('Today', 'NN'), (',', ','), ('NLP', 'NNP'), ('has', 'VBZ'), ('expanded', 'VBN'), ('beyond', 'IN'), ('these', 'DT'), ('two', 'CD'), ('languages', 'NNS'), ('and', 'CC'), ('can', 'MD'), ('deal', 'VB'), ('with', 'IN'), ('most', 'JJS'), ('languages', 'NNS'), (',', ','), ('including', 'VBG'), ('sign', 'JJ'), ('language', 'NN'), ('.', '.')], [('NLP', 'NNP'), ('is', 'VBZ'), ('actively', 'RB'), ('used', 'VBN'), ('for', 'IN'), ('a', 'DT'), ('variety', 'NN'), ('of', 'IN'), ('day-to-day', 'JJ'), ('activities', 'NNS'), ('like', 'IN'), ('spam', 'NN'), ('detection', 'NN'), (',', ','), ('recruitment', 'NN'), (',', ','), ('smart', 'JJ'), ('assistants', 'NNS'), (',', ','), ('understanding', 'VBG'), ('customer', 'NN'), ('behavior', 'NN'), ('&', 'CC'), ('so', 'RB'), ('on', 'IN'), ('.', '.')], [('Usage', 'NN'), ('and', 'CC'), ('impact', 'NN'), ('of', 'IN'), ('NLP', 'NNP'), ('is', 'VBZ'), ('growing', 'VBG'), ('exponentially', 'RB'), (',', ','), ('across', 'IN'), ('a', 'DT'), ('wide', 'JJ'), ('range', 'NN'), ('of', 'IN'), ('industries', 'NNS'), ('.', '.')], [('Acronym', 'NNP'), ('NLP', 'NNP'), ('is', 'VBZ'), ('used', 'VBN'), ('for', 'IN'), ('both', 'DT'), ('Natural', 'NNP'), ('Language', 'NNP'), ('Processing', 'NNP'), ('&', 'CC'), ('Neuro', 'NNP'), ('Linguistic', 'NNP'), ('Programming', 'NNP'), (',', ','), ('but', 'CC'), (',', ','), ('these', 'DT'), ('are', 'VBP'), ('completely', 'RB'), ('different', 'JJ'), ('fields', 'NNS'), ('of', 'IN'), ('science', 'NN'), ('.', '.')], [('Neuro', 'NNP'), ('Linguistic', 'NNP'), ('Programming', 'NNP'), ('deals', 'NNS'), ('with', 'IN'), ('human-to-human', 'JJ'), ('interaction', 'NN'), (',', ','), ('where', 'WRB'), ('as', 'IN'), ('Natural', 'NNP'), ('Language', 'NNP'), ('Processing', 'NNP'), ('deals', 'NNS'), ('with', 'IN'), ('human-to-computer', 'JJ'), ('interaction', 'NN'), ('.', '.')]]
Output is displayed in a list of tuples for every element in the sentence. In the first sentence of the first document, the word “processing” is used as a proper noun (NNP). But, the word “processing” can also be used as a verb, in the context of “I am processing the data.”. For machines to completely understand the meaning of the text, identifying the correct part of speech becomes important.
Special characters usually do not add value to the text. One way they can be utilized is to parse text upon the occurrence of any particular special character or indicate the need for words expansions. Once they have been utilized to treat the text, they can be removed from the text to exclude any redundant information being processed through NLP algorithms.
from string import punctuation
import re text_data.Text = [re.sub('['+punctuation+']', ' ', sent) for sent in text_data.Text] [print(sent) for sent in text_data.Text]
‘string’ package contains following punctuation marks !”#$%&'()*+,-./:;?@[]^_`{|}~. This list can be modified by adding/removing any character.
Output:
Natural Language Processing NLP is a field within Artificial Intelligence AI that is concerned with how computers deal with human language We are already interacting with such machines in our day to day life in the form of IVRs chat bots But do Machines really understand human language context syntax semantics etc Yes they can be trained to do so Google has trained it s search engine to make autofill recommendations as text is typed using NLP Google s search engine has the capability of understanding meaning of words depending on the context in the search Google s state of the art search engine is one of the most sophisticated examples of NLP Origination of Natural Language Processing dates back to II world war when there was a need for machine translation between Russian English Today NLP has expanded beyond these two languages and can deal with most languages including sign language NLP is actively used for a variety of day to day activities like spam detection recruitment smart assistants understanding customer behavior so on Usage and impact of NLP is growing exponentially across a wide range of industries Acronym NLP is used for both Natural Language Processing Neuro Linguistic Programming but these are completely different fields of science Neuro Linguistic Programming deals with human to human interaction where as Natural Language Processing deals with human to computer interaction
Punctuation marks are replaced by a single space.
Like special characters, certain words do not add any value to the text. These are called stop words. They can belong to any part of speech. Usually, there is a general list of stop words that can be used for any NLP task but it can be modified depending on the text.
For instance, finding “with” as the most frequent token or even, finding a noun “language” as the most common token in the above paragraph is not useful information. These redundant words should be removed to generate truly valuable insights.
from nltk.corpus import stopwords import re stop_words = set(stopwords.words('english')) stopwords_all = "("+') | ('.join([s for s in stop_words])+")" text_data.Text = [re.sub(stopwords_all,' ',sent) for sent in text_data.Text] [print(sent) for sent in text_data.Text]
NLTK contains a predefined set of stop words for various languages. This set can be altered to add/remove any word, depending on the context and quality of the text.
Natural Language Processing NLP field within Artificial Intelligence AI concerned computers deal human language We already interacting machines day day life form IVRs chat bots But Machines really understand human language context syntax semantics etc Yes trained Google trained search engine make autofill recommendations text typed using NLP Google search engine capability understanding meaning words depending context search Google state art search engine one sophisticated examples NLP Origination Natural Language Processing dates back II world war need machine translation Russian English Today NLP expanded beyond two languages deal languages including sign language NLP actively used variety day day activities like spam detection recruitment smart assistants understanding customer behavior Usage impact NLP growing exponentially across wide range industries Acronym NLP used Natural Language Processing Neuro Linguistic Programming completely different fields science Neuro Linguistic Programming deals human human interaction Natural Language Processing deals human computer interaction
Helping verbs such as “is” / “are”, determiners such as “that” / “they”, prepositions such as “with”, adjective such as “these” have been removed. Comparing the first sentence of the first document, “is a”, “that is”, “with”, “how” have been dropped.
White space characters are very commonly found in any text. They can also be introduced in the text as part of other cleansing exercises. These are unnecessary and irrelevant. Below code can be used to remove any such white space character.
import re text_data.Text = [re.sub(r's+|t+|n+|r+|f+',' ',sent).strip() for sent in text_data.Text] [print(sent) for sent in text_data.Text]
s, t, n, r, f, respectively, represents space, tab, carriage return, line feed and form feed. ‘strip()’ function removes any leading and trailing spaces.
Output:
Natural Language Processing NLP field within Artificial Intelligence AI concerned computers deal human language We already interacting machines day day life form IVRs chat bots But Machines really understand human language context syntax semantics etc Yes trained Google trained search engine make autofill recommendations text typed using NLP Google search engine capability understanding meaning words depending context search Google state art search engine one sophisticated examples NLP Origination Natural Language Processing dates back II world war need machine translation Russian English Today NLP expanded beyond two languages deal languages including sign language NLP actively used variety day day activities like spam detection recruitment smart assistants understanding customer behavior Usage impact NLP growing exponentially across wide range industries Acronym NLP used Natural Language Processing Neuro Linguistic Programming completely different fields science Neuro Linguistic Programming deals human human interaction Natural Language Processing deals human computer interaction
Multiple white space characters are replaced by a single space and any leading and trailing blanks are removed.
Document Term matrix provides structure to the unstructured data. It’s a basic way of creating fixed-length input for machine learning algorithms. Every document is represented as a row and every token as a column. The two most common value sets are count and TF-IDF. As the name suggests, the count is the total number of occurrences of every token in a document. It assesses how common or rare a word is in a document. TF-IDF (Term Frequency – Inverse Document Frequency) is Term Frequency (proportion of count of token in a document by a count of all tokens in the document) weighted by Inverse Document Frequency (inverse of the total number of the documents containing the token). It’s used to find tokens that can be used to distinguish and classify documents.
# roll up cleansed sentences at document level text_doc_cleansed = text_data.groupby('Document ID')['Text'].apply(list) text_doc_cleansed = [' '.join(doc) for doc in text_doc_cleansed] print(text_doc_cleansed)
Rolled up sentences at the document level.
['Natural Language Processing NLP field within Artificial Intelligence AI concerned computers deal human language We already interacting machines day day life form IVRs chat bots But Machines really understand human language context syntax semantics etc Yes trained', 'Google trained search engine make autofill recommendations text typed using NLP Google search engine capability understanding meaning words depending context search Google state art search engine one sophisticated examples NLP', 'Origination Natural Language Processing dates back II world war need machine translation Russian English Today NLP expanded beyond two languages deal languages including sign language', 'NLP actively used variety day day activities like spam detection recruitment smart assistants understanding customer behavior Usage impact NLP growing exponentially across wide range industries', 'Acronym NLP used Natural Language Processing Neuro Linguistic Programming completely different fields science Neuro Linguistic Programming deals human human interaction Natural Language Processing deals human computer interaction']
Count
from sklearn.feature_extraction.text import CountVectorizer countvectorizer = CountVectorizer() countvectors = countvectorizer.fit_transform(text_doc_cleansed) countfeature_names = countvectorizer.get_feature_names() countdense = countvectors.todense() countdenselist = countdense.tolist() count_df = pd.DataFrame(countdenselist, columns=countfeature_names) print(count_df)
Output:
acronym across actively activities ai already art artificial assistants autofill back ... usage used using variety war we wide within words world yes 0 0 0 0 0 1 1 0 1 0 0 0 ... 0 0 0 0 0 1 0 1 0 0 1 1 0 0 0 0 0 0 1 0 0 1 0 ... 0 0 1 0 0 0 0 0 1 0 0 2 0 0 0 0 0 0 0 0 0 0 1 ... 0 0 0 0 1 0 0 0 0 1 0 3 0 1 1 1 0 0 0 0 1 0 0 ... 1 1 0 1 0 0 1 0 0 0 0 4 1 0 0 0 0 0 0 0 0 0 0 ... 0 1 0 0 0 0 0 0 0 0 0
Output is a dataframe with 5 rows representing each document and 113 columns representing each token.
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer tfidfvectorizer = TfidfVectorizer() tfidfvectors = tfidfvectorizer.fit_transform(text_doc_cleansed) tfidffeature_names = tfidfvectorizer.get_feature_names() tfidfdense = tfidfvectors.todense() tfidfdenselist = tfidfdense.tolist() tfidf_df = pd.DataFrame(tfidfdenselist, columns=tfidffeature_names) print(tfidf_df)
Output:
acronym across actively activities ai already art artificial assistants ... using variety war we wide within words world yes 0 0.000000 0.000000 0.000000 0.000000 0.161541 0.161541 0.000000 0.161541 0.000000 ... 0.000000 0.000000 0.000000 0.161541 0.000000 0.161541 0.000000 0.000000 0.161541 1 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.138861 0.000000 0.000000 ... 0.138861 0.000000 0.000000 0.000000 0.000000 0.000000 0.138861 0.000000 0.000000 2 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.201746 0.000000 0.000000 0.000000 0.000000 0.201746 0.000000 3 0.000000 0.204921 0.204921 0.204921 0.000000 0.000000 0.000000 0.000000 0.204921 ... 0.000000 0.204921 0.000000 0.000000 0.204921 0.000000 0.000000 0.000000 0.000000 4 0.161969 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
Dimension of the data frame with TF-IDF is the same as that of with count. To compare the vectorization, output with count as value set has tokens with the same appearances within a document as well as across documents, represented by the same vector. However, with the TF-IDF value set, tokens with the same occurrences could be represented by different vectors depending on the length of the document. As mentioned earlier, tokens with the same number of occurrences in a document could also be represented by different vectors, depending on the number of documents they appear in.
Word cloud is a text visualization technique where tokens are printed, where their size represents their frequency or importance in the text.
import matplotlib.pyplot as plt from wordcloud import WordCloud wordcloud = WordCloud(background_color="white", width=3000, height=2000, max_words=500).generate_from_frequencies(tfidf_df.T.sum(axis=1)) plt.imshow(wordcloud)
The above steps are not exhaustive to complete text cleansing and gain a full understanding of its structure, syntax, and semantics. There are more tasks such as changing cases, expanding contractions, harmonizing text (where the same entity is represented in more than one way), spell check, dependency parsing, and so on. The steps to be performed depend on the NLP objective. For instance, in the case of text summarization/classification, the text would be studied in its entirety, whereas for Named Entity Recognition / Parts of Speech tagging, paragraphs might be broken into sentences/tokens. Quality of text and objective of study play a huge role in determining the level of preprocessing.