Ever wondered how to query and analyze raw data? Also, have you ever tried doing this with Athena and QuickSight? This blog post will walk you through the necessary steps to achieve this using Amazon services and tools. Amazon’s perfect combination of storage, transformation, and visualization can help you achieve your data analysis and visualization goal.
The required AWS services and tools are as follows:
Simple Storage Service (S3) is an online store where you can store and retrieve any type of data on the web, regardless of time and place.
With the help of simple SQL, we can analyze and query raw data by using AWS Athena‘s interactive service.
QuickSight is an AWS-based Business Intelligence and visualization tool used to visualize data, perform ad hoc analysis, and derive business insights from our data.
Let’s start by checking the flow of data from left to right. We will first upload our source file to S3, then connect Athena to S3 to query the data, and finally, use QuickSight to visualize the information.
Upload data to S3
Create a new bucket.
Create a new folder in this bucket because Athena needs the folders inside it to access the data.
Upload the source file to the previously created folder.
Create a Table in Athena and Query the Data
There are three ways to access Athena: the AWS Management Console, the Amazon Athena API, or the AWS CLI. For this blog, we will be using the AWS Management Console. Before working with Athena, ensure the Athena region is the same as the S3 bucket region created earlier. Otherwise, you won’t be able to connect and query the data.
AWS Athena has a simple and easy-to-understand interface.
https://aws.amazon.com
In the left part, you can see the databases along with the tables and views that are part of the selected database. The right section is for writing SQL queries, and the result of the query we ran is displayed in the results section. In addition, Athena allows us to save or format our query.
The menu structure is easy to navigate and includes five primary tabs: Query Editor, Saved Queries, History, AWS Glue Data Catalog, and Workgroup: Primary.
So let’s start working with Athena. First, we need to create a table, and there are several options to do this. We will create a table from the S3 bucket. Once we select this option, we will be redirected to the four-step process of creating the table.
Step 1: Name and Location
This step defines the database, table name, and input data set location.
You can select a database from the drop-down list if you already have a database. If not, you can create a new database by selecting Create a new database, and then you need to define its name. Next, we need to enter the name of the table. This name should be unique worldwide. After naming the table, we need to define the source file’s location. This is where the S3 comes into play. We need to define the path to the folder, so we need to specify the bucket’s name and the folder’s name. Note that we cannot provide a file path; we can only provide a folder path.
Step 2: Format the data
There are several data formats that we can access using Athena. In this example, we will use a CSV file to select CSV as the data format.
Step 3: Columns
The source file does not contain the column names, so we need to specify them in this step. This is something that is required for Athena to know the schema of the data we are working with. This can be achieved in two ways. One way is to simply enter the column name and column type for each column individually, and the other way is to add the columns in bulk as shown in the image below.
The second method is more suitable when our dataset has too many columns, and it is tedious to configure each one individually. The format is pretty simple. We need to enter the column name, followed by a space, and followed by the data type in that column. A comma separates column definitions.
Step 4: Partitions
In this step, we can configure partitions. Amazon suggests splitting the data to reduce the amount of data a query needs to scan. This can increase performance and reduce query costs. This is preferable if we have a large data set, but in our example, we have a small data set so it is not required.
With this last step, we have completed the table creation process. We created it using the Create Table Wizard, but that’s not the only way to do it. Athena also allows us to create tables using DDL statements, as shown in the figure below.
Now all that remains is to query the table and see if the configuration is correct. To test this, let’s run this simple SQL query:
SELECT DISTINCT region, product Category, COUNT (productCategory) AS Quantity FROM sales WHERE region IN ('Central', 'East', 'West') GROUP BY region, product Category ORDER BY region
After running this query in the results section, we can see the output shown in the image below. Output files are saved automatically for each query run, regardless of whether the query itself has been saved or not.
In addition, by selecting the file icon in the Results section, we can download a file with the query results in CSV format. We can also download it from the Query History tab.
Visualize Data with Amazon QuickSight
We use AWS Quicksight for data visualization purposes. But in addition to AWS Quicksight, you can also use Tableau, Looker, Mode Analytics, and more for advanced reports and visualizations. In our case, we will use QuickSight. It is an AWS-based Business Intelligence and visualization tool used to visualize data, perform ad hoc analysis, and derive business insights from our data. Data is entered as a dataset. AWS uses Superfast Parallel In-Memory Computing (SPICE) to perform data calculations and create graphs quickly. Amazon QuickSight retrieves data from Athena, S3, RDS, RedShift, MySQL, Snowflake, and many more.
You must create an account if you have never worked with QuickSight. Before connecting it to Athena, ensure that QuickSight has the right to access the information. You must enable QuickSight to access Amazon Athena and S3. If QuickSight does not have these rights, we will be unable to analyze and visualize the data we have queried in Athena. Another important thing is also the region. The region we selected for QuickSight must be the same as the region we selected for Athena. Otherwise, QuickSight will not be able to access the data we have in Athena.
Several visualizations or graphic formats are available. We can choose charts like bar charts, pie charts, ring charts, line charts, etc.
https://aws.amazon.com
In the image above, we used a vertical bar chart to visualize the data we previously queried in Athena effectively. The dashboard automatically updates as data is updated or scheduled.
Conclusion
AWS Athena is used to running queries quickly and easily without having to set up and manage any servers or data warehouses. This service is the right choice if you need to analyze large data sets. In this blog, we used S3 to store data, connected Athena to S3 to query the data, and finally, used QuickSight to visualize the information.
AWS-based Business Intelligence and visualization tool used to visualize data, perform ad hoc analysis, and derive business insights from our data.
There are three ways to access Athena: the AWS Management Console, the Amazon Athena API, or the AWS CLI.
The menu structure is easy to navigate and includes five primary tabs: Query Editor, Saved Queries, History, AWS Glue Data Catalog, and Workgroup: Primary.
Amazon suggests splitting the data to reduce the amount of data a query needs to scan. This can increase performance and reduce query costs. This is preferable if we have a large data set, but in our example, we have a small data set, so it is not required.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.
Data Analyst who love to drive insights by visualizing the data and extracting the knowledge from it. Automating various tasks using python & builds Real time Dashboard's using tech like React and node.js. Capable of Creaking complex SQL queries to fetch the accurate data.
We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.
Show details
Powered By
Cookies
This site uses cookies to ensure that you get the best experience possible. To learn more about how we use cookies, please refer to our Privacy Policy & Cookies Policy.
brahmaid
It is needed for personalizing the website.
csrftoken
This cookie is used to prevent Cross-site request forgery (often abbreviated as CSRF) attacks of the website
Identityid
Preserves the login/logout state of users across the whole site.
sessionid
Preserves users' states across page requests.
g_state
Google One-Tap login adds this g_state cookie to set the user status on how they interact with the One-Tap modal.
MUID
Used by Microsoft Clarity, to store and track visits across websites.
_clck
Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.
_clsk
Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.
SRM_I
Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.
SM
Use to measure the use of the website for internal analytics
CLID
The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.
SRM_B
Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.
_gid
This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.
_ga_#
Used by Google Analytics, to store and count pageviews.
_gat_#
Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.
collect
Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.
AEC
cookies ensure that requests within a browsing session are made by the user, and not by other sites.
G_ENABLED_IDPS
use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.
test_cookie
This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.
_we_us
this is used to send push notification using webengage.
WebKlipperAuth
used by webenage to track auth of webenagage.
ln_or
Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.
JSESSIONID
Use to maintain an anonymous user session by the server.
li_rm
Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.
AnalyticsSyncHistory
Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.
lms_analytics
Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.
liap
Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.
visit
allow for the Linkedin follow feature.
li_at
often used to identify you, including your name, interests, and previous activity.
s_plt
Tracks the time that the previous page took to load
lang
Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings
s_tp
Tracks percent of page viewed
AMCV_14215E3D5995C57C0A495C55%40AdobeOrg
Indicates the start of a session for Adobe Experience Cloud
s_pltp
Provides page name value (URL) for use by Adobe Analytics
s_tslv
Used to retain and fetch time since last visit in Adobe Analytics
li_theme
Remembers a user's display preference/theme setting
li_theme_set
Remembers which users have updated their display / theme preferences
We do not use cookies of this type.
_gcl_au
Used by Google Adsense, to store and track conversions.
SID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
SAPISID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
__Secure-#
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
APISID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
SSID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
HSID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
DV
These cookies are used for the purpose of targeted advertising.
NID
These cookies are used for the purpose of targeted advertising.
1P_JAR
These cookies are used to gather website statistics, and track conversion rates.
OTZ
Aggregate analysis of website visitors
_fbp
This cookie is set by Facebook to deliver advertisements when they are on Facebook or a digital platform powered by Facebook advertising after visiting this website.
fr
Contains a unique browser and user ID, used for targeted advertising.
bscookie
Used by LinkedIn to track the use of embedded services.
lidc
Used by LinkedIn for tracking the use of embedded services.
bcookie
Used by LinkedIn to track the use of embedded services.
aam_uuid
Use these cookies to assign a unique ID when users visit a website.
UserMatchHistory
These cookies are set by LinkedIn for advertising purposes, including: tracking visitors so that more relevant ads can be presented, allowing users to use the 'Apply with LinkedIn' or the 'Sign-in with LinkedIn' functions, collecting information about how visitors use the site, etc.
li_sugr
Used to make a probabilistic match of a user's identity outside the Designated Countries
MR
Used to collect information for analytics purposes.
ANONCHK
Used to store session ID for a users session to ensure that clicks from adverts on the Bing search engine are verified for reporting purposes and for personalisation
We do not use cookies of this type.
Cookie declaration last updated on 24/03/2023 by Analytics Vidhya.
Cookies are small text files that can be used by websites to make a user's experience more efficient. The law states that we can store cookies on your device if they are strictly necessary for the operation of this site. For all other types of cookies, we need your permission. This site uses different types of cookies. Some cookies are placed by third-party services that appear on our pages. Learn more about who we are, how you can contact us, and how we process personal data in our Privacy Policy.