How is AWS Athena different from other databases?

Gitesh Dhore Last Updated : 25 Jul, 2022

6 min read

This article was published as a part of the Data Science Blogathon.

Introduction

Amazon Athena is an interactive query service based on open-source Apache Presto that allows you to analyze data stored in Amazon S3 using ANSI SQL directly. In addition, it is serverless, so there is no infrastructure to manage and maintain, and you only pay for the queries you run.

To start with Athena, you need to define the schema of your data stored in Amazon S3; you’re ready to start querying it with SQL. The schema is determined using the Amazon Glue Data Catalog, which allows you to create a unified metadata repository across multiple services.

It can be used alongside or instead of traditional databases depending on the specific business and technical scenario. But, first, it is essential to understand the differences and why you would choose one over the other.

Differentiating Athena from Database and Warehouse

Athena works more like a query engine than a particular database. This means that:

Compute and storage are decoupled: Databases store data at rest and provide the resources needed to perform queries and calculations. Each of these comes with direct and indirect overhead costs. It does not store data – instead, storage is managed entirely on Amazon S3. The Athena query service is fully managed, so resources are automatically allocated by AWS as needed to execute a query.
No DML interface: No need to model data with Athena. I/O is the bottleneck of virtually every database, but it’s not a problem with Athena. And since you don’t have to waste I/O bandwidth on data modelling, you can focus all computing resources on query processing.

Advantages of Using Athena

Serverless Design Reduces IT Overhead: Amazon Athena is serverless, meaning there is no user-side infrastructure to manage or configure. Using Athena is as simple as defining a query, and you only pay for the questions you run. As a result, there are no additional IT costs and no clusters to manage.

Based on SQL: You can use Athena to run SQL queries against the desired table that is configured in the Glue data catalogue or data sources that you can connect to using the Athena Query Federation SDK. For users who already know SQL, there is no learning curve to get started.
Open architecture (no vendor lock-in): Athena enables open access to data rather than lock-in to a specific tool or technology. This manifests itself in various ways;
Ubiquitous Access: Because your data is stored in an S3 bucket and the schema is defined in the Glue Data Catalog, you can switch between query engines that can read from these sources without redefining the schema or creating a separate copy of the data.
Separated storage and computing resources: Athena has a complete separation of computing and memory resources. Data is stored in your Amazon S3 account, while Amazon Web Services provide Athena computation as a shared resource among all Athena users.
Open file formats: Unlike many high-performance databases, Athena does not use a proprietary file format but supports standard open source formats such as Apache Parquet, ORC, CSV, and JSON.
Low cost: Athena’s pricing model is based on terabytes of scanned data. You can control and keep costs down by checking only the data you need to answer a specific query (this can be done using data splitting – see below).
Access to all your data: Most organizations process only 30 to 35 percent of their data into a traditional data warehouse due to the high operational and infrastructure costs of constantly resizing database clusters. Because this storage costs a fraction of what you would pay to keep the same data in a data warehouse, you can handle larger volumes of data without worry.
Custom Connectors: Amazon Athena lets you run SQL queries across multiple data sources, which can drive various business intelligence and analytics processes. You can use JDBC to connect Athena with BI and machine learning tools.

Image source – https://www.sqlshack.com/an-introduction-to-aws-athena/

Limitations of Athena

No built-in insert/update/delete operations: Because Athene is a query engine with no DML interface, upsets can be difficult.

The optimization is limited to queries: You can optimize your questions, not your data. However, your data is already stored in Amazon S3; performing transformations to use Athena may affect other users using the exact information for other purposes.
Multi-tenancy means pooled resources: All Athena users receive a similar SLA for queries at any time. In other words, the entire global user base is “competing” for the same resources – and although AWS provides more as needed, this could mean that query performance fluctuates depending on other people’s usage.
No indexing: Indexes are integrated into traditional databases but do not exist in Athena. This makes joining large tables a demanding operation that increases the load on Athena and negatively impacts performance. For example, running a query by key requires scanning all the data and searching for the desired key in the result list. This is solved using Upsolver lookup tables.
Partitioning: Efficient queries in Athena require partitioning of the data. Maintaining the number of partitions in the park that meet your performance needs is essential. Every 500 divisions scanned will add 1 second to your query.

Other Products Required with Athena

Athena is never a standalone product but rather always part of a package that includes:

Amazon S3: Athena queries run directly on Amazon S3, so this is where your data will be stored.
Glue Data Catalog: A centralized managed schema that allows you to replace or augment Athena with other services as needed (for example, with Amazon Redshift Spectrum).
ETL Tools: While Athena can run almost any query out of the box, reducing costs and improving performance requires following a set of performance tuning best practices. The traditional way is to use Spark, which can process large volumes of unstructured data; however, this option requires considerable coding knowledge. Some solutions offer managed Spark as a service that simplifies the infrastructure aspects but doesn’t remove the coding overhead.

Use Case

Athena helps analyze unstructured, semi-structured, and structured data stored in Amazon S3. Data can be stored in CSV, JSON or columnar formats such as Apache Parquet and Apache ORC. It can also be used to run queries using ANSI SQL, and this does not require the user to aggregate or load data into Athena.

It can be integrated with Amazon Quick Sight for data visualization purposes to help generate reports and explore data using business intelligence tools such as SQL clients that interface with JDBC or ODBC driver.

Athena can also be integrated with the AWS Glue Data Catalog. AWS Data Catalog provides persistent metadata storage for user data in Amazon S3. This way, tables can be created, and data can be queried in Athena, all based on a centralized metadata repository available throughout the user account. It can also be integrated with ETL (Extract, Transform, Load) and data discovery features included in the AWS glue catalog.

Conclusion

Athena is a service offered by Amazon that is an interactive query service. Athena makes it easy for the user to directly analyze data in Amazon S3 (Simple Storage Service) using standard SQL. For example, in the Amazon Management Console, it can be set to point to where data is stored in Amazon S3 with a few clicks of a button. SQL can then be used to run ad-hoc queries, bringing the result to the user in seconds.

It does not store data. Instead, storage is managed entirely on Amazon S3. The Athena query service is fully managed, so resources are automatically allocated by AWS as needed to execute a query.
Because your data is stored in an S3 bucket and the schema is defined in the Glue Data Catalog, you can switch between query engines that can read from these sources without redefining the schema or creating a separate copy of the data.
Indexes are integrated into traditional databases but do not exist in Athena. This makes joining large tables a demanding operation that increases its load and negatively impacts performance. For example, running a query by key requires scanning all the data and searching for the desired key in the result list. This is solved using Upsolver lookup tables.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Gitesh Dhore

I am a Machine Learning Enthusiast. Done some Industry level projects on Data Science and Machine Learning. Have Certifications in Python and ML from trusted sources like data camp and Skills vertex. My Goal in life is to perceive a career in Data Industry.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

How is AWS Athena different from other databases?

Introduction

Differentiating Athena from Database and Warehouse

Advantages of Using Athena

Limitations of Athena

Other Products Required with Athena

Use Case

Conclusion

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC