Text Mining hack: Subject Extraction made easy using Google API

Tavish Srivastava Last Updated : 19 Jul, 2020

6 min read

Let’s do a simple exercise. You need to identify the subject and the sentiment in following sentences:

Google is the best resource for any kind of information.
I came across a fabulous knowledge portal – Analytics Vidhya
Messi played well but Argentina still lost the match
Opera is not the best browser
Yes, like UAE will win the Cricket World Cup.

Was this exercise simple? Even if this looks like a simple exercise, now imagine creating an algorithm to do this? How does that sound?

The first example is probably the easiest, where you know “Google” is the subject. Also we see a positive sentiment about the subject. Automating the two components namely subject mining and sentiment mining are both difficult tasks, given the complex structure of English language.

Basic sentiment analysis is easy to implement because positive / negative word dictionary is abundantly available on internet. However, subject mining dictionaries are very niche and hence user needs to create his own dictionary to find the subject. In this article, we will talk about subject extraction and ways to automate it using Google API.

Also See: Basics of creating a niche dictionary

Why is subject extraction not a common analysis?

The most common projects using text mining are those with sentiment analysis. We rarely hear about subject mining analysis. Why is it so?

The answer is simple. Social media, the major source of unstructured data, does a good job with subject mapping. Users themselves make the subject of most comment very obvious by hash tagging. These hash tags help us search for a subject related comments on social media. And most of our analysis is based on a single subject which makes subject mining redundant. It would be awesome if we hash tag all our statements . In that case text mining would have become a cake walk because indirectly I am creating a structured text on social media. Only, if the world was ideal!

Hash tags could also help to find sarcasm to some extent. Consider the following tweet :

Indian batting line was just fine without Sachin. #Sarcasm #CWC2015 #Sachin #Indian-Cricket-Team

Think of this sentence without hash-tags. It would be incomplete and would give a different meaning. Mining for hash tag(#sarcasm) will indicate that the sentence is most probably a negative sentence. Also, multiple subjects can be extracted from the hash tags and added to this sentence.

Hopefully, you can now realize the importance of these hash tags in data management and data mining of social networks. They enable the social media companies to understand our emotions, preferences, behavior etc.

Why do we even need subject extraction?

However, social media do a good job with subject tagging, we still have a number of other sources of unstructured informations. For instance, consider the following example :

You run a grocery stores chain. Recently you have launched a co-branded card which can help you understand the buying patterns of your customers. Additionally this card can be used at other retail chains. Given, that you now will have transaction information of your customers at other retail chains, you will be in a better position to increase the wallet share of the customer at your store. For instance, if a customer buys all vegetables at your store but fruits at other, you might consider giving the customer a combo of fruits & vegetables.

In this scenario, you need to mine the name of retail store and some description around the type of purchase from the transaction description. No hash-tags or other clues, hence you need to do the hard work!

What are the challenges in subject extraction?

There are multiple other scenarios where you will need to do subject mining. Why do we call it a “hard work”? Here are the major challenges you might face while subject mining :

Rarely do we find a ready-made dictionary to mine subjects.
Creating a subject based dictionary is extremely manual task. You need to pull a representative sample then pull those keywords and find a mapping subject.
Standardization of subjects is another challenge. For instance, take following transaction description :
- “Pizza Hut paid $50”
- “Pizzahut order for $30”

Now even if we build a system which can populate the first/first 2 words as the subject, we can’t find a common subject for the above two. Hence, we need to build a dictionary which can identify a common subject i.e. “Pizza Hut” for both these sentences.

Possible Framework to build a Subject Extraction Dictionary

There are two critical steps in building a subject mining dictionary:

Find the keywords occurring frequently in the text. This has been covered in detail in this article.
Create a mapping dictionary from these keywords to a standardized subject list.

For second part, here are the sub-steps you need to follow :

Find the most frequent words in your text / tweets / comments. (assume a minimum threshold for a word to appear in the list).
Find the most associated word with these frequently occurring word. (You again need to assume a minimum association threshold)
Combine the frequently occurring words with associated words to find searchable pairs.

Now all we need to do is to match subjects for each of these pairs. We search pairs and not single words because we need enough context to search for the phrase. For example “Express” might mean “American Express” or “Coffee Express”, two words can give enough context whereas more than two words will make the dictionary too big.

Here are some examples of this process :

“Wall Mart has the best offers”

“Tesco stores are not good with discounts”

“New Wall Mart stores are supposed to open this year”

“Tesco Stores have the coolest loyalty programs and discounts”

Most Frequent words: After removing stop-words : 1. Wall 2. Mart 3.Tesco 4. Stores

Most Associated words: 1. Wall & Mart , 2. Mart & Wall , 3. Tesco & Stores , 4. Stores & Tesco

Now we’ll use these words to search for the right subject.

How to automate the process of Subject Extraction Dictionary Creation?

Second step of subject mining is creating keyword to subject pairs. This step is generally done manually, but let’s take a shot at automating this process. Here is what we intend to do :

Pick up the keyword pairs found significant in the context (coming from last step).
Google Search on this pair
Pick the first 4 links which Google would give.
If two of the first 4 links are same, we return back to the URL. In case the search is not unanimous, we return “No Match Found”.

Let’s first create a function which can retrieve the first four links from Google on a search and then find if we have a common link. Here is code to do the same :

Now, let’s create a list of keywords which our code can search. (Notice that each of these keywords are quite different but Google will help us standardize them)

Its now time to test our function :

And Bingo! You see that our code was given different inputs but our code has done fairly well to spot the right set of subjects. Also notice that this dictionary is not limited by any scope of the subject. Two of its searches are Fast Food chains. Third one is an analytics website. Hence, we are creating a more generalized dictionary in this case. Now all we need to do is build rules using these keywords and map them to the matched links.

Here is the entire code :

[stextbox id=”grey”]

import urllib
import json
import numpy as np
from urlparse import urlparse
from bs4 import BeautifulSoup
def searchengine(examplesearch):
      encoded = urllib.quote(examplesearch)
      rawData = urllib.urlopen('http://ajax.googleapis.com/ajax/services/search/web?v=1.0&q='+encoded).read()
      jsonData = json.loads(rawData)
      searchResults = jsonData['responseData']['results']
      links = np.empty([4, 1], dtype="S25")
      i = 0
     for er in searchResults:
           link = er['url']
           link1 = urlparse(link).netloc 
           links[i,0]=link1
           i = i + 1
      target = "No Match found" 
      if links[0,0] == links[1,0] or links[0,0] == links[2,0] or links[0,0] == links[3,0]:
         target = links[0,0] 
      if links[1,0] == links[2,0] or links[1,0] == links[3,0]:
         target = links[1,0] 
      if links[2,0] == links[3,0] :
         target = links[2,0] 
      return [target]

import numpy as np
import pandas as pd
import pylab as pl 
import os
os.chdir(r"C:\Users\Tavish\Desktop")
Transaction_details = pd.read_csv("Descriptions.csv")
Transaction_details["match"] = "blank"
Transaction_details

for i in range(0,11):
       descr = Transaction_details['Descriptions'][i]
       Transaction_details["match"][i] = searchengine(descr)
Transaction_details

[/stextbox]

End Notes

The approach mentioned in this article can be used to create a generalized dictionary which is not restricted to any subject. Frequently, we use the super powers of Google to auto correct the input keywords to get the most appropriate results. If this result is unanimous, it tells us Google has found a decent match from the entire web world. This approach minimizes the human effort of creating such tedious subject extraction dictionaries.

Thinkpot: Can you think of more cases where Google API’s are used? Share with us useful links of related video or article to leverage Google API

Did you find the article useful? Do let us know your thoughts about this article in the box below.

If you like what you just read & want to continue your analytics learning, subscribe to our emails, follow us on twitter or like our facebook page.

Tavish Srivastava

Tavish Srivastava, co-founder and Chief Strategy Officer of Analytics Vidhya, is an IIT Madras graduate and a passionate data-science professional with 8+ years of diverse experience in markets including the US, India and Singapore, domains including Digital Acquisitions, Customer Servicing and Customer Management, and industry including Retail Banking, Credit Cards and Insurance. He is fascinated by the idea of artificial intelligence inspired by human intelligence and enjoys every discussion, theory or even movie related to this idea.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

sumalatha

hi tavish, your approach is excellent and innovative. I used think of doing my compulsory project work on the impact of search suggestions google shows to us while we type our search string. I have a feeling that such suggestions distract the user from his original intended search/purpose. but, I'm unable to copy all those suggestions without really selecting each and everyone. Plz let me know that if there a way to get those suggestions through Google API? ( Though I have already completed my MBA, I'm still interested to know.)

Show 1 reply

Tavish

Hi Sumalatha, It is possible to extract the entire page using Google API but the process is slightly more complex. It involves extracting text using BeautifulSoup. You can find multiple tutorials for the same. I will try to publish such article in future if we get enough requirement from people reading this article. Thanks, Tavish

Reading list

Introduction to NLP

Text Pre-processing

NLP Libraries

Regular Expressions

String Similarity

Spelling Correction

Topic Modeling

Text Representation

Information Retrieval System

Word Vectors

Word Senses

Dependency Parsing

Language Modeling

Getting Started with RNN

Different Variants of RNN

Machine Translation and Attention

Self Attention and Transformers

Transfomers and Pretraining

Question Answering

Text Summarization

Named Entity Recognition

Coreference Resolution

Audio Data

ASR

Audio Separation

Chatbot

Auto NLP

Text Mining hack: Subject Extraction made easy using Google API

Why is subject extraction not a common analysis?

Why do we even need subject extraction?

What are the challenges in subject extraction?

Possible Framework to build a Subject Extraction Dictionary

How to automate the process of Subject Extraction Dictionary Creation?

End Notes

If you like what you just read & want to continue your analytics learning, subscribe to our emails, follow us on twitter or like our facebook page.

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit