Basic Concept and Backend of AWS Elasticsearch

Trupti Dekate Last Updated : 14 Jun, 2023

7 min read

This article was published as a part of the Data Science Blogathon.

Introduction

Elasticsearch is a search platform with quick search capabilities. It is a Lucene-based search engine developed in Java but supports clients in various languages such as Python, C#, Ruby, and PHP. It takes unstructured data from multiple sources as input and stores it in a structured format that proves optimal for language searches.

Source: aws.amazon.com

As mentioned above, Elasticsearch focuses on search capabilities and features. It is useful for searching multiple data types. It has a distributed architecture that enables near-real-time search and analysis of large volumes of data.

The ability to scale from one machine to hundreds of machines sets it apart from many other tools. A fully featured search cluster is easy to run, although it requires a high degree of expertise. In addition to search-oriented uses, Elasticsearch is also useful for storing data that requires grouping by multiple dimensions. It is used for metrics logs, traces, and many other time series data are some examples of its analytical use.

AWS Elasticsearch

Amazon Elasticsearch Service or AWS Elastic search is now called Amazon OpenSearch Service. Amazon OpenSearch supports both OpenSearch and Legacy Elasticsearch OSS. When creating clusters, users have the option to choose a search engine. There is broad compatibility between OpenSearch and Elasticsearch OSS version 7.10, which is also the final version of this open-source software. OpenSearch is an open-source search engine that offers analytics tool features for real-time log analysis and application monitoring.

The Basic Concepts Behind Elasticsearch

It is essential to understand some key concepts. Below is a glossary of several Elasticsearch components that will be necessary to understand.

Documents: Before we understand “documents,” let’s look at the most commonly used term called, JSON. It is also a global format for Internet data exchange. To understand this, we can compare documents to rows in a relational database representing the entity we are looking for.

However, here documents are not limited to plain texts but include structured data encoded in JSON. Each document has a unique ID and data type. These details are important for determining the data type of the document.

Source: aws.amazon.com

2. Indexes: Multiple documents with similar properties form an index. Interestingly, it’s also the top-level entity against which to run a query in Elasticsearch. The documents in the register are logically related. An index is represented by a name that identifies it during indexing and other operations.

3. Inverted Index: The search mechanism on which the engines work. Mapped data is stored here (content to place in the document). Take note here that these strings are not stored directly but split the document down to the level of a specific search item.

The process continues further and maps each of these search items to the documents in which they occur. This enables fast full-text searches even for large volumes of data.

AWS Elasticsearch – Backend Concepts

Several Elasticsearch components are hidden or can be labeled as backend components.

They are listed below:

Source: aws.amazon.com

Cluster: A cluster refers to a group of multiple nodes that are connected. Here, Elasticsearch distributes tasks and crawls and indexes all nodes in the cluster.
Node: A node is one server in a cluster. It is the node where the data is stored, and the cluster indexing and retrieval process takes place. There are many ways to configure nodes for Elasticsearch.
- Master node: This type of node is called the control room for the Elasticsearch cluster because it controls all operations, such as creating or removing an index or adding or removing nodes.
- Data node: This node stores and performs data-related operations like data aggregation.
- Client node: This node sends requests to the appropriate nodes. Let’s take an example; it sends cluster requests to the master node and any data requests to the nodes.
Shards: As mentioned earlier, the index is further divided into several parts called “Shards.” Each shard is an independent index, fully functional, and can be hosted on any given node in the cluster. The documents in the index are distributed into different chunks. These chunks are sent to different nodes, creating redundancy that is very useful in protecting against hardware failure and data loss. It also increases query capacity.
Replicas: Replicas are copies of the primary data fragment. Each document in the index is part of one primary fragment. As explained above, replicas create copies of data to avoid a hardware failure situation. It also increases responsiveness to requests.

Abilities

Let’s understand the main capabilities of Elasticsearch:

Search Engine: Elasticsearch’s unique selling point is that it allows easy full-text searching. This feature was missing from traditional SQL database management systems because they lacked full-text search engine capabilities for voluminous data.
Analytics Engine: Elasticsearch also attributes a lot of popularity to its analytics usage. Popularly used for log analysis and numerical partitioning data such as performance matrices. It also allows data aggregation (Elasticsearch aggregation queries), which enhances data visualization.
Scalable architectural design: Thanks to its distributed architecture, Elasticsearch has a built-in capacity to scale to multiple servers. It also can store data in petabytes. This is often seen that distributed systems are complex, but not here in Elasticsearch. The ability to scale is much easier than most other systems. Elasticsearch also automatically replicates data in node failure situations, helping to prevent data loss.
The right investment choice: The Elasticsearch mechanism is easy to understand, especially when small data sets. It has a common API that integrates well with other tools like Logstash for sending data to Elasticsearch or Kibana for data visualization. A shorter learning curve and these capabilities make it easy to get started with Elasticsearch, increasing productivity.
Well-documented API: This is another pen that has led to its growing popularity. Developers can take advantage of the availability of integration APIs. In addition, Elasticsearch provides compatible client libraries for many programming languages such as Java, JavaScript, PHP, etc., which makes the integration process easy for developers.

Working of AWS Elasticsearch

The primary purpose of Elasticsearch is to receive and manage semi-structured data. This is an inverted index managed by Apache’s API that serves as the primary data structure used by Elasticsearch.

You must be wondering what an “inverted index is.” Read on to get the answers!

Source: aws.amazon.com

The mapping of each unique token to a given list of documents containing that word is an inverted index. This process makes identifying documents using a given keyword a quick process. There are several partitions called “Shards” in which index information is stored. Elasticsearch cannot only dynamically distribute and allocate shards to nodes in a cluster but also replicate them. This provides flexibility to the data distribution process.

Distributing copies of primary shards to different cluster nodes provides a redundancy feature. These primary fragments are used during index operations, while both types of fragments are used when running search queries. Query execution performance is improved with multiple nodes and replicas.

Use Cases

There are some basic use cases for Elasticsearch:

Search Applications: This is especially important for websites that depend on a search platform to access, retrieve and report data.
Website Search: Elasticsearch is very important in providing accurate and fast search queries for websites that store huge amounts of data. It has now established a stronghold in web search.
Enterprise Search: Elasticsearch also enables enterprise-wide search, such as document search, e-commerce product search, etc. It has also become the most trusted search solution for many websites.
Log Analytics: As mentioned earlier, Elasticsearch is a common tool for analyzing log data in near real-time. Not only that, its scalable capabilities and essential operational insight make it a popular choice.
Security Analysis: Security analysis is another important domain in which Elasticsearch plays a very important role. It analyzes access logs and similar logs related to security systems using the ELK stack, which shows a complete analysis.
Business Analytics: Many built-in features in the ELK stack also make it a popular business analytics tool. However, gaining in-depth know-how about implementing these tools may take longer.

Advantages

Here are some of the benefits listed:

High-Performance standards: Elasticsearch can simultaneously process huge volumes of data, providing fast search query results.
Application Development: It supports multiple programming languages such as Java, Python, PHP, etc., making it a popular choice for developers for application development.
Fast operation speed: Elasticsearch operations such as read and write are as fast as the blink of an eye, enabling it to be used for near-real-time use cases such as application monitoring.
Fast time to value: Elasticsearch provides simple REST-based APIs and uses schema-free JSON documents. This makes it easy to use to quickly build applications for many use cases.
Additional tools: Kibana is a visualization and reporting tool integrated with Elasticsearch. Elasticsearch also provides integration with Beats and Logstash, which allows loading transformations of source data into clusters. There are plenty of plugins available that can enhance the functionality of apps.

Frequently Asked Questions

Q1. What is Elasticsearch in AWS?

A. Elasticsearch in AWS is a fully managed service provided by Amazon Web Services (AWS) that allows users to deploy and run Elasticsearch clusters in the cloud. Elasticsearch is an open-source search and analytics engine built on top of Apache Lucene, designed for storing, searching, and analyzing large volumes of data in near real-time. AWS Elasticsearch service simplifies the deployment, scaling, and management of Elasticsearch clusters, eliminating the need for manual setup and configuration. It offers features such as automated backups, high availability, security controls, and integration with other AWS services, making it a convenient choice for implementing search and analytics solutions in the cloud.

Q2. What are types in Elasticsearch?

A. In Elasticsearch, types refer to logical categories or labels that are assigned to documents within an index. However, starting from Elasticsearch version 7.0, the concept of types has been deprecated, and a single index can only have one type called “_doc”. Prior to version 7.0, multiple types could exist within an index, allowing for further categorization and organization of documents.

Conclusion

Elasticsearch also attributes a lot of popularity to its analytics usage. Popularly used for log analysis and numerical partitioning data such as performance matrices. It also allows data aggregation (Elasticsearch aggregation queries), which enhances data visualization. Scalable architectural design: Elasticsearch has a built-in capacity to scale to multiple servers thanks to its distributed architecture. It also can store data in petabytes. This is often seen that distributed systems are complex, but not here in Elasticsearch.

Elasticsearch focuses on search capabilities and features. It is useful for searching multiple data types. It has a distributed architecture that enables near-real-time search and analysis of large volumes of data.
Decisions are made automatically, ensuring a smooth management API. The ability to scale is much easier than most other systems. Elasticsearch also automatically replicates data in node failure situations, helping to prevent data loss.
Amazon Elasticsearch Service or AWS Elastic search is now called Amazon OpenSearch Service. Amazon OpenSearch supports both OpenSearch and Legacy Elasticsearch OSS. OpenSearch is an open-source search engine that offers analytics tool features for real-time log analysis and application monitoring.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Trupti Dekate

I am an Accountant at Global private Analytics Services working with the Data Analysis Team for handling the budget of various Growing Companies. We provide service of analytics and made the work of new tech companies easy by helping them manage their total investment and giving suggestions.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to Generative AI

Introduction to Generative AI applications

No-code Generative AI app development

Code-focused Generative AI App Development

Introduction to Responsible AI

LLMS

Prompt Engineering

Finetuning LLMs

Training LLMs from Scratch

Langchain

RAG

LlamaIndex

Stable Diffusion

Basic Concept and Backend of AWS Elasticsearch

Introduction

AWS Elasticsearch

The Basic Concepts Behind Elasticsearch

AWS Elasticsearch – Backend Concepts

Abilities

Working of AWS Elasticsearch

Use Cases

Advantages

Frequently Asked Questions

Conclusion

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang

s_tp

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

s_pltp

s_tslv

li_theme

li_theme_set

Google (11)

_gcl_au