Most of you would know the different approaches for building a data and analytics platform. You would have already worked on systems that used traditional warehouses or Hadoop-based data lakes. Some of you might have also read about Lakehouses.
Selecting one among these several approaches to build a new data platform can become very confusing. You need to make the right choice based on your data strategy and overall data requirements. This article will help you understand each approach’s differences, benefits and limitations and help you decide what best suits your use case.
Data Warehouse – The good old way of Decision Making!
Enterprises have been building Data Warehouses for more than a couple of decades. It is one of the most popular and widely adopted approaches for storing data for insights generation and decision-making.
Datawarehouse for BI workloads
Use Cases: When should you Implement a Data Warehouse?
You should implement a Data Warehouse when you have the below data requirements.
The program’s main objective is to build a platform for insight generation, reporting & dashboards, and decision-making based on past data.
You want to analyse the historical data to understand the current trends, customer behaviour, and spending patterns.
You are dealing with mainly structured data and not-so-complex semi-structured data.
You don’t require streaming data analytics.
You are not analysing unstructured data from social media feeds or IoT devices.
Benefits
Data Warehouse has some well-proven benefits and has already passed the time test. Some of the key benefits are listed below.
“Best in class” performance for OLAP workloads.
ACID support – easy to handle updates, deletes concurrent reads & writes.
Support for ANSI SQL, one of the most popular languages among data engineers, analysts, and business users.
Matured implementation lifecycles & proven case studies. Almost all large or small enterprises have built an Enterprise Data warehouse at some point in time.
Limitations
While the Data warehouse enjoys the above benefits, it also has a few critical limitations, as listed below.
No support for handling unstructured data.
Not the best suited for streaming workloads; latency can be in a few seconds.
It does not support AI/ML use cases as unstructured data cannot be stored.
Expensive as it often uses proprietary tools & platforms.
No easy scaling of storage & compute separately.
Quick Tips: Go for a standalone warehouse implementation when dealing with only structured data from traditional source systems like files or databases. However, in today’s world, it is difficult to imagine any business – large or small, not working with unstructured data. You might build a warehouse as the first step and then later plan for a data lake for unstructured or streaming use cases.
Data Lake: The Storage for all your Data!
With the rise of the Hadoop ecosystem, many enterprises started building data lakes. Lakes that can store all of your data – structured, semi-structured, and unstructured data. Data Lakes soon became the default choice of data storage, and many enterprises still have massive data lakes supporting their analytical data workloads.
Data Lake for AI/ML workloads
Use Cases: When to consider a Data Lake?
Data Lakes are a great choice for Big Data Implementations when dealing with massive data that needs to be stored, managed and analysed for machine learning use cases.
Go for a data lake when you have the below requirements
For storing a wide variety of data, including unstructured data from various social media feeds or IoT sources.
For supporting AI/ML workloads for building recommendation engines, forecasting or predictions.
For streaming use cases for analysis of messages from source systems with minimum latency requirements
Benefits
Data Lakes provides great advantages over stand-alone warehouses. Some of the key benefits are listed below.
Cost efficient – you can leverage the cost benefits offered by cloud object storage for building a data lake.
Support for AI/ML use cases (which is difficultwith a standalone Warehouse)
Helps in persisting, managing and analysing all data – including semi-structured and unstructured data.
No proprietary storage – you can use any computing engine to extract data from the data lake. E.g. Spark, and Presto, along with other data catalogue tools.
Not closely coupled with any compute service – separate scaling is easily possible.
Limitations
Like standalone warehouses, data lakes also have certain limitations. The critical ones that can impact your platform choice are listed below.
Data Lakes are immutable; you cannot update or delete the data.
Does not support ACID – consistency can be challenging.
Performance is not so good when it comes to analytical processing.
Not easy to maintain metadata – Lakes can quickly become data swamps with tons of data that cannot be easily discovered.
Quick Tips: Data Lakes are great for supporting AI/ML use cases. Implement a data lake for such use cases to store large volumes of unstructured data generated from IoT devices or social media feeds.
However, it might not be a great choice to implement an analytical decision-making system that uses historical data for insights generation or decision-making or systems that need ACID features of a warehouse.
Warehouse & Lake: Answer to all your Data Needs!
Considering the above limitations, many enterprises have now started adopting a combined approach of implementing a data lake and a data warehouse. This would help to overcome their respective limitations.
A combination of a data lake + a data warehouse can solve most of your data needs and help you build all your data use cases.
Use Cases: When to consider a Data Lake + Data Warehouse
You can consider implementing this architecture pattern in most scenarios; some of these are listed below.
You want to support BI as well as AI/ML use cases.
You need a system that can accommodate all future requirements – including batch, streaming, ML-based workloads or event IoT data analytics.
You want to leverage the power of Apache Spark for unstructured data analysis and the simplicity of SQL for querying data for OLAP analysis.
Benefits
You would get benefits that each of these systems provided individually. These include the points listed below.
ACID features of data warehouse
Support to store, manage and analyse the unstructured data
Support to implement BI as well as AI/ML workloads
Support for batch as well as streaming use cases
Limitations
Though this approach seems the best among all other approaches, it also has some key limitations that can be challenging to address. Some of the key issues are highlighted below.
No single version of the truth – the same data is in the lake and Warehouse.
Additional efforts to move data from the lake to the Warehouse
Not easy to keep data in sync between lake and Warehouse
No single metadata repository – challenging to maintain an in-sync catalogue for both the systems
Challenging to implement access control across Lake and Warehouse
Not easy to orchestrate workloads between Lake and Warehouse
Quick Tips: Implementing a data lake and complimenting it with a warehouse is one of themost adopted approaches in today’s times. It has evolved from previous approaches of implementing lakes and warehouses separately. You can consider this approach if you already have a Warehouse or a Lake & want to enhance the existing system to leverage the other component’s benefits.
Lakehouse: Is this the future?
The year 2022 seems like the start of an era for Lakehouses! Everyone is now talking about building a true Lakehouse – a system that helps build a single version of the truth and simultaneously gives you the best of both worlds.
Use Cases: When should you build a Lakehouse?
Lakehouse is a different architectural pattern for building your data platforms. It is a data lake but with the additional features of a warehouse. There are different open technologies used for building a lakehouse. One of the most popular among these is Delta Lake.
You can consider building a Lakehouse in the following scenarios
When you want to implement an eco-system that can support BI and AI uses cases
When you don’t want any lockins by using any vendor-specific products – you can use open table formats like Delta for implementing a Lakehouse
You don’t have an existing Warehouse and are looking for a new platform that can support your data needs
Benefits
Lakehouses provide many benefits – some cannot be ignored and should be leveraged to gain cost and performance benefits. Some of the key advantages are listed below.
Best of both worlds – cost efficiencies of lake and performance of a warehouse
The single version of the truth – no duplicate data at two different storage platforms
No efforts are required to move data between the Lake and the Warehouse
Built using open source technology – vendor agnostic implementation is possible
Limitations
Lakehouses are still very new; there can be some limitations that you might face while implementing a Lakehouse.
A relatively new approach needs more acceptance and adoption from the industry.
There can be limitations specific to storage formats like Apache Iceberg, Apache Hudi or Delta Lake, which is the backbone of any Lakehouse.
Integration with BI tools and connectivity with source systems can be challenging.
Quick Tips: Lakehouses seem to be the future for building data platforms. It can give the best of both worlds, but at the same time, there can be limitations related to underlying open-source technologies. The best approach to building a Lakehouse is to start with a pilot phase and see if it suits your requirements.
Conclusion
In this article, we learnt about the difference, benefits and limitations of various approaches to building a data platform. Here is a quick summary of the different approaches that you can consider while selecting the right approach for building your data platform
Datawarehouse
The platform for storing structured data that is best suited for BI workloads
Excellent performance
Uses proprietary storage
Data Lake
The platform for storing structured, semi-structured and unstructured data that can support AI/ML -workloads
Cost-efficient, Scalable
Does not support ACID features
Data Lake + Data Warehouse
A combined approach that can support BI, as well as AI/ML, use cases
Supports all data types
The same data is stored in Lake & Warehouse
Needs efforts to keep data in sync
Lakehouse
The latest trend can offer the best of both worlds with the cost efficiencies of data lakes and the performance of warehouses.
Supports BI + AI/ML, all data types
The single version of the truth
Supports vendor-agnostic implementation
You can select any of these based on your overall data strategy and current and future data requirements.
That’s it for now; I hope you have enjoyed this article. Stay tuned for more!
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.
I'm a Cloud Data Architect helping teams to get started with their cloud data journey!
I am an AWS/Azure/Snowflake/Databricks certified data professional & work as an independent consultant on various activities like consulting/training/mentoring - all within the cloud data space!
We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.
Show details
Powered By
Cookies
This site uses cookies to ensure that you get the best experience possible. To learn more about how we use cookies, please refer to our Privacy Policy & Cookies Policy.
brahmaid
It is needed for personalizing the website.
csrftoken
This cookie is used to prevent Cross-site request forgery (often abbreviated as CSRF) attacks of the website
Identityid
Preserves the login/logout state of users across the whole site.
sessionid
Preserves users' states across page requests.
g_state
Google One-Tap login adds this g_state cookie to set the user status on how they interact with the One-Tap modal.
MUID
Used by Microsoft Clarity, to store and track visits across websites.
_clck
Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.
_clsk
Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.
SRM_I
Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.
SM
Use to measure the use of the website for internal analytics
CLID
The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.
SRM_B
Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.
_gid
This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.
_ga_#
Used by Google Analytics, to store and count pageviews.
_gat_#
Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.
collect
Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.
AEC
cookies ensure that requests within a browsing session are made by the user, and not by other sites.
G_ENABLED_IDPS
use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.
test_cookie
This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.
_we_us
this is used to send push notification using webengage.
WebKlipperAuth
used by webenage to track auth of webenagage.
ln_or
Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.
JSESSIONID
Use to maintain an anonymous user session by the server.
li_rm
Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.
AnalyticsSyncHistory
Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.
lms_analytics
Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.
liap
Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.
visit
allow for the Linkedin follow feature.
li_at
often used to identify you, including your name, interests, and previous activity.
s_plt
Tracks the time that the previous page took to load
lang
Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings
s_tp
Tracks percent of page viewed
AMCV_14215E3D5995C57C0A495C55%40AdobeOrg
Indicates the start of a session for Adobe Experience Cloud
s_pltp
Provides page name value (URL) for use by Adobe Analytics
s_tslv
Used to retain and fetch time since last visit in Adobe Analytics
li_theme
Remembers a user's display preference/theme setting
li_theme_set
Remembers which users have updated their display / theme preferences
We do not use cookies of this type.
_gcl_au
Used by Google Adsense, to store and track conversions.
SID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
SAPISID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
__Secure-#
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
APISID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
SSID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
HSID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
DV
These cookies are used for the purpose of targeted advertising.
NID
These cookies are used for the purpose of targeted advertising.
1P_JAR
These cookies are used to gather website statistics, and track conversion rates.
OTZ
Aggregate analysis of website visitors
_fbp
This cookie is set by Facebook to deliver advertisements when they are on Facebook or a digital platform powered by Facebook advertising after visiting this website.
fr
Contains a unique browser and user ID, used for targeted advertising.
bscookie
Used by LinkedIn to track the use of embedded services.
lidc
Used by LinkedIn for tracking the use of embedded services.
bcookie
Used by LinkedIn to track the use of embedded services.
aam_uuid
Use these cookies to assign a unique ID when users visit a website.
UserMatchHistory
These cookies are set by LinkedIn for advertising purposes, including: tracking visitors so that more relevant ads can be presented, allowing users to use the 'Apply with LinkedIn' or the 'Sign-in with LinkedIn' functions, collecting information about how visitors use the site, etc.
li_sugr
Used to make a probabilistic match of a user's identity outside the Designated Countries
MR
Used to collect information for analytics purposes.
ANONCHK
Used to store session ID for a users session to ensure that clicks from adverts on the Bing search engine are verified for reporting purposes and for personalisation
We do not use cookies of this type.
Cookie declaration last updated on 24/03/2023 by Analytics Vidhya.
Cookies are small text files that can be used by websites to make a user's experience more efficient. The law states that we can store cookies on your device if they are strictly necessary for the operation of this site. For all other types of cookies, we need your permission. This site uses different types of cookies. Some cookies are placed by third-party services that appear on our pages. Learn more about who we are, how you can contact us, and how we process personal data in our Privacy Policy.