Data mining is extracting relevant information from a large corpus of natural language. Large data sets are sorted through data mining to find patterns and relationships that may be used in data analysis to assist solve business challenges. Thanks to data mining techniques and technologies, enterprises can forecast future trends and make more educated business decisions.
Syntax and semantics are the two major components of any written language. Syntax is the rules/grammar should follow by the writer while writing a language, and semantics is the internal meaning of the sentence or phrases the author has written.
So as a result, two types of data mining techniques are widely used in the data science industry today.
Syntax-based data mining: Here, we extract the data based on characters or words in the sentence. Not taking the implicit meaning. E.g. TF-IDF Method.
Pros:
Easy to implement
Easy to interpret
Cons:
Not always accurate
Not taking the actual context of the data mining scenario
Semantic-based data mining: Here, we are extracting the data based on the internal meaning of the corpus and sentence. Working based on the implicit meaning of the sentence.
Pros:
Takes the actual context of the data mining scenario
Can build more trust and robustness with the end user
Cons:
Hard to implement
More calculations need for the text matching
As we have seen above, not only do semantic-based data mining applications have many advantages, but also it can make a strong trust connection between the end user.
Many organizations and tech companies working on data mining are now turning from syntax-based data mining processes to semantic base data mining processes. So in this blog, I am describing some interview questions on semantic-based data mining that may chance to ask in the data mining engineer interview process.
Semantics-based Data Mining Interview Questions
1 – What is semantic search? How does it differ from syntactic search?
Ans: Semantic search is a method of data searching in which a search query seeks to identify keywords and ascertain the intent and context of the words being used. Search engines utilise semantic search as a method to try to decipher your search query’s context and intent to provide you with results that are relevant to your search.
Syntactic matching refers to matching search terms to keywords depending on the search terms entered into the search engine. This would be phrase- and exact-matching. Semantic matching matches search queries to keywords based on the intent of the searcher’s input. Broad match applies here.
2- What is the role of embeddings in semantic search?
Ans: In semantic search, the search operations are performed based on the embeddings. There are different types of semantic search is there. E.g., Textual semantic search in the case of natural language systems, Visual semantic search in the case of image processing systems..etc. Whatever the case, we convert the input and corpus objects to numerical fixed-length vectors. And the rest of the matching operations are applied to these vectors. These vectors are called embeddings.
A sample flow diagram is shown below.
Making the textual sentences into embedding format enables the system to perform mathematical operations faster by retaining the inner meaning. Also, embedding will help systems encode similar sentences to similar encoding vectors with fixed/variable length.
So any semantic search-based system will contain the following 2 modules minimum.
1 – Embedding module
Where the encoding of the objects to embedding takes place
2 – Searching module
Where the actual search/match operations upon the embedding take place
3 – What are the different types of semantic similarity methods?
Ans:
There are different types of semantic similarity methods. All of the methods are categorized under four major types. Those are
Knowledge-Based Similarity
This kind is what we use to compare concepts semantically. A node in an ontology graph represents each notion in this category. Because the graph represents the concepts from the corpus, this approach is also known as the topological approach. A smaller number of edges connecting two concepts (nodes) indicates that they are more semantically and conceptually similar.
E.g., In The below image, Figure – 1, the similarity structure shows that the coin is more related to money than the credit
Statistical-Based Similarity
This kind uses feature vectors learned from the corpus to calculate semantic similarity.
Word embeddings can be used in conjunction with most of the techniques in this category to improve results because they capture the semantic relationship between words.
String-Based Similarity
To calculate the distance between non-zero feature vectors, evaluating semantic similarity does not depend on this type alone but rather mixes it with other types.
It again classified into two major types
Character-Based Similarity Measure
Longest Common Substring (LCS)
Damerau-Levenshtein
Jaro
Jaro-Winkler
Needleman-Wunsch
Etc…
Term-Based Similarity Measure
Block Distance
Cosine Similarity
Soft Cosine Similarity
Sorensen-Dice index
Euclidean Distance:
Jaccard Index
Etc…
Language Model-Based Similarity
The scientific community first used this type in 2016 to quantify the semantic similarity of two English sentences under the premise that they are syntactically sound.
This type has five main steps:
Removing stop words
Tagging the two phrases using any Part of Speech (POS) algorithm
From the tagging step output, this type forms a structure tree for each phrase (parsing tree)
Building an undirected weighted graph using the parsing tree
Finally, the similarity is calculated as the minimum distance path between nodes (words)
4 – How can we transform natural language sentences into vectors/embeddings?
Ans: There are different methods to convert sentences to vectors/embeddings. Some of those are
One Hot Encoding
Each word in the vocabulary V is given an integer index, I, in this manner, ranging from 0 to V-1. The vector representation for each word is of length V, with all 0s except 1 at the ith index for the corresponding word. It is a representation at the word level.
BOW & BON
Bag Of Words(BOW)
As its name implies, Bag Of Word is simply a collection of words that ignores two important factors in its vector representations. 1) The order of the tokens in the sentence or sequence, and 2) the token’s semantic significance. The majority of Bag of Words is based only on the frequency of tokens found in a sentence. It is a representation at the sentence level.
Bag Of N-grams(BON)
BON can be considered an improved form of BoW in which, rather than measuring the frequency of individual tokens, we generate groups of N tokens and do so. The term “N-gram” refers to each group. This is a sentence-level illustration.
TF-IDF
A single float value per word called TF-IDF solves a particular word significance problem that may be highly useful in text classification. There is currently no idea of word importance or how significant a single word is in a document or sentence in any of the three versions we read above. It is a representation at the sentence level.
Word2Vec (CBoW & Skip Gram)
Word2Vec is a neural network-based model for learning word embeddings and is known as one of the most well-known names in the word embedding arena. The two most pressing issues we had before are resolved: lower dimension representations and context-based word interpretation (vicinity words). It is a representation at the word level.
Continuous Bag Of Words (CBoW)
CBoW aims to train a Neural Network that predicts the target word using input from context words.
Skip Gram Model
Skip Gram, which operates in complete opposition to CBoW, aims to anticipate context words from a single input word.
fastText
fastText is another word embedding method that extends the word2vec model. Instead of learning vectors for words directly, fastText represents each word as an n-gram of characters
Global Vectors (GloVe)
We can utilise the complete corpus to derive an embedding for a word using the GloVe approach. It is a vector representation technique at the word level. GloVe incorporates the core of the entire corpus using word-word co-occurrence probability. The gloVe is the name given to the created embedding because it is a global statistic.
Doc2Vec (Distributed Memory & Distributed BoW)
A good approach for getting document/paragraph/sentence embedding in a fixed-length vector is Doc2Vec, which is an extension of Word2Vec. It has two variants, just like Word2Vec:
Distributed Memory Doc2Vec for skip-gram model
Distributed BoW Doc2Vec for BoW model
Transformer based embedding
These are the latest types of word embeddings based on transformer architecture. The major advantage of this embedding is that it can capture the full context of the sentence it is embedding.
Several transformer-based models are available for this and can generate the best contextual embedding based on the use case.
Etc…
5 – Describe the vector database and its use cases.
Ans: A vector database, which has features like CRUD operations, metadata filtering, and horizontal scaling, indexes and saves vector embeddings for quick retrieval and similarity searches. Vector databases are used in cases where the traditional Relational Database Management Systems or semi-structured data storage structures are not enough to meet the actual requirement.
Normally vector databases are used in the case of large amounts of unstructured data handling scenarios. Usually, unstructured data are encoded in the format of embedding and stored those embeddings in the vector database. When the user query arrives, it will match the query embedding with corpus embeddings and then retrieve the most semantically matching results for the end user.
Other than textual embeddings, vector databases can handle any kind of embedding, such as visual, video, and concept embeddings…So as a result, vector databases can use for multi-domain semantic matching scenarios.
Ans: An approximative closest neighbor search algorithm may return points at most c times farther away from the query than its nearest points.
An estimated nearest neighbor is frequently almost as good as an exact one, which is one of the appeals of this method. Particularly, minor variations in distance shouldn’t matter, provided the distance metric effectively conveys the idea of user quality.
ANN uses methods like locality-sensitive-hashing to better balance speed and accuracy, acting as a faster classifier with a little accuracy trade-off. With datasets in higher dimensions, where algorithms like KNN will fail to perform better and relevant results within optimized time
There are five different types of ANN-based algorithms are used in semantic search. Those are,
Brute Force
Although it isn’t strictly an ANN algorithm, it offers the most logical solution and a standard against which to compare all other models. Before sorting, it estimates the distance between each point in the datasets to determine which point is closest to the other. extremely ineffective
Hashing Based(LSH)
involves a stage of preprocessing when the data is filtered into several hash tables to prepare for the querying procedure. After receiving a query, the algorithm loops back through the hash tables, collecting all similarly hashed locations and assessing closeness to produce a list of the closest neighbors.
Graph-Based
begins with a set of “seeds” and produces several graphs before utilizing the best-first search to traverse the graphs. The implementation may identify the “true” nearest neighbor by using a visited vertex parameter from each neighbor.
Partition Based
The implementation divides the dataset into increasingly distinguishable subsets until it settles on the closest neighbor.
Hybrid
This is the combination of some of the above methods.
7 – What are the ANN benchmarks?
Ans: Fast nearest-neighbor searches in high-dimensional spaces are a growingly important challenge in the case of semantic search. A benchmarking environment for approximate nearest neighbor algorithms search is called ANN-Benchmarks.
The idea of comparing/benchmarking ANN algorithms to comprehend better how implementation options can vary based on the kind and quantity of the dataset
There are several precomputed data sets available for this. All data sets are divided into train and test sets and include the top 100 neighbors’ ground truth information. They are kept in HDF5 format. The following table contains info about the different datasets used for the ANN benchmarks.
Different methods of approximate nearest neighbor-based searching techniques are evaluated as part of the ANN benchmarking. Some of the evaluated techniques are VESPA, ANNOY, FIASS, and K-GRAPH… etc
Here is the full list
For more details, please check the official GitHub page here
8 – Compare WEAVIATE, ANNOY, and FIASS
Ans:
WEAVIATE
Weaviate is a vector search engine and vector database. It uses machine learning to vectorize and store data and to find answers to natural language queries. Low-latency vector search engine with out-of-the-box support for different unstructured media types. It offers Semantic Search, Question-Answer-Extraction, Classification, Customizable Models, and more. In weaviate, Vectorized search along with Multilayered graph(HNSW) storage enhances the semantic level data retrieving. The transformer Based module can use for vectorization and GraphQL for querying.
ANNOY
To locate the closest approximate neighbors, the library Annoy is being employed. It is an open-source library that can be used to look for points in space that are nearby a given query point. To allow many processes to share the same data, it generates sizable read-only file-based data structures mapped into memory. It is utilized to compare words or texts in a vector space for our purposes. Spotify develops the library.
FIASS
Faiss is a package for grouping and searching for comparable objects in dense vectors. It has algorithms that can search through vector set collections of any size, including those that would not fit in RAM. Additionally, it has an accompanying code for parameter adjustment and evaluation. Faiss is a C++ program that has full Python/NumPy wrappers. On the GPU, some of the most beneficial algorithms are implemented. It is developed at Facebook AI Research.
9 – Have you practical coding experience with semantic text search tools?
Ans – Yes.
The tools I have used for the development of semantic matching in the unstructured data domain are given as follows. An official website link is also provided. Please check.
Ans– Locality-sensitive hashing (LSH) is a way used in computer science that hashes similar inputs into the same “buckets” with a high likelihood (The number of buckets is much smaller than the universe of possible input items). LSH is one of the algorithms for approximate nearest neighbors.
The term “LSH” refers to a family of functions that hashes similar inputs into the same “buckets” with a high likelihood (The number of buckets is much smaller than the universe of possible input items) and others in different . This makes differentiating between observations that are similar in several ways easier.
Steps
The LSH algorithm has three major modules
Shingling
This stage involves converting each document into a set of k characters ( k-shingles or k-grams). The main concept is to represent each document on our application as a set of k-shingles.
Now, we need a metric to check how similar different documents are. For this, the Jaccard Index is a proposed option. Documents A and B’s Jaccard Index is represented as:
Min hashing
Hashing is to use the hashing algorithm H to each document into a tiny signature. The choice of hashing function and the similarity metric we employ are closely related. The proper hashing function for Jaccard similarity is min-hashing.
Locality-sensitive hashing(LSH)
Given the signatures of two documents, finding an algorithm that can determine if the two documents constitute a candidate pair—i.e., whether their similarity exceeds a threshold t—is the aim of LSH.
Semantic search in unstructured data is one of the hottest research domains today. Because all of us like a search engine or an AI application that can read our actual intent while interacting rather than just do – don’t do command manner. Semantic search does the same, operating on all domains, including text, visual, speech, etc. Using this technique, we can extract semantically similar data from large corpora. In this blog, I have covered some common interview questions that may ask in the data mining interview process.
Key Takeaways
In contrast to lexical search, which focuses on finding exact matches between the query words or their variants without considering the query’s overall meaning, semantic search refers to search with meaning.
The term “embedding” refers to how words are represented for text analysis, often as a real-valued vector that encodes the word’s meaning.
To manage the distinct structure of vector embeddings, vector databases were specifically designed. They compare values and select the ones most similar to one another to index vectors for quick search and retrieval.
Locality-sensitive hashing is an algorithmic technique used in computer science that, with a high degree of probability, hashes similar input items into the same “buckets.”
It is permitted for an approximate nearest neighbor search algorithm to return points that are at most c times farther away from the query than its nearest points.
I hope this article helped you to strengthen your semantic search fundamental concepts. Feel free to leave a remark below if you have any questions, concerns, or recommendations.
Keep learning..😊 !
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.
A passionate data scientist. I love to explore data and extract insights that can help solve complex problems. With my knowledge of programming languages such as Python, I am proficient in developing models and analyzing large data sets. My passion for learning has led me to continuously expand my skillset and stay up-to-date with the latest trends in the field. I am committed to using data science to make a positive impact on society and believe that the power of data can transform businesses and organizations.
We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.
Show details
Powered By
Cookies
This site uses cookies to ensure that you get the best experience possible. To learn more about how we use cookies, please refer to our Privacy Policy & Cookies Policy.
brahmaid
It is needed for personalizing the website.
csrftoken
This cookie is used to prevent Cross-site request forgery (often abbreviated as CSRF) attacks of the website
Identityid
Preserves the login/logout state of users across the whole site.
sessionid
Preserves users' states across page requests.
g_state
Google One-Tap login adds this g_state cookie to set the user status on how they interact with the One-Tap modal.
MUID
Used by Microsoft Clarity, to store and track visits across websites.
_clck
Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.
_clsk
Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.
SRM_I
Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.
SM
Use to measure the use of the website for internal analytics
CLID
The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.
SRM_B
Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.
_gid
This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.
_ga_#
Used by Google Analytics, to store and count pageviews.
_gat_#
Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.
collect
Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.
AEC
cookies ensure that requests within a browsing session are made by the user, and not by other sites.
G_ENABLED_IDPS
use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.
test_cookie
This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.
_we_us
this is used to send push notification using webengage.
WebKlipperAuth
used by webenage to track auth of webenagage.
ln_or
Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.
JSESSIONID
Use to maintain an anonymous user session by the server.
li_rm
Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.
AnalyticsSyncHistory
Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.
lms_analytics
Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.
liap
Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.
visit
allow for the Linkedin follow feature.
li_at
often used to identify you, including your name, interests, and previous activity.
s_plt
Tracks the time that the previous page took to load
lang
Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings
s_tp
Tracks percent of page viewed
AMCV_14215E3D5995C57C0A495C55%40AdobeOrg
Indicates the start of a session for Adobe Experience Cloud
s_pltp
Provides page name value (URL) for use by Adobe Analytics
s_tslv
Used to retain and fetch time since last visit in Adobe Analytics
li_theme
Remembers a user's display preference/theme setting
li_theme_set
Remembers which users have updated their display / theme preferences
We do not use cookies of this type.
_gcl_au
Used by Google Adsense, to store and track conversions.
SID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
SAPISID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
__Secure-#
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
APISID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
SSID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
HSID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
DV
These cookies are used for the purpose of targeted advertising.
NID
These cookies are used for the purpose of targeted advertising.
1P_JAR
These cookies are used to gather website statistics, and track conversion rates.
OTZ
Aggregate analysis of website visitors
_fbp
This cookie is set by Facebook to deliver advertisements when they are on Facebook or a digital platform powered by Facebook advertising after visiting this website.
fr
Contains a unique browser and user ID, used for targeted advertising.
bscookie
Used by LinkedIn to track the use of embedded services.
lidc
Used by LinkedIn for tracking the use of embedded services.
bcookie
Used by LinkedIn to track the use of embedded services.
aam_uuid
Use these cookies to assign a unique ID when users visit a website.
UserMatchHistory
These cookies are set by LinkedIn for advertising purposes, including: tracking visitors so that more relevant ads can be presented, allowing users to use the 'Apply with LinkedIn' or the 'Sign-in with LinkedIn' functions, collecting information about how visitors use the site, etc.
li_sugr
Used to make a probabilistic match of a user's identity outside the Designated Countries
MR
Used to collect information for analytics purposes.
ANONCHK
Used to store session ID for a users session to ensure that clicks from adverts on the Bing search engine are verified for reporting purposes and for personalisation
We do not use cookies of this type.
Cookie declaration last updated on 24/03/2023 by Analytics Vidhya.
Cookies are small text files that can be used by websites to make a user's experience more efficient. The law states that we can store cookies on your device if they are strictly necessary for the operation of this site. For all other types of cookies, we need your permission. This site uses different types of cookies. Some cookies are placed by third-party services that appear on our pages. Learn more about who we are, how you can contact us, and how we process personal data in our Privacy Policy.