We are all pretty much familiar with the common modern cloud data warehouse model, which essentially provides a platform comprising a data lake (based on a cloud storage account such as Azure Data Lake Storage Gen2) AND a data warehouse compute engine such as Synapse Dedicated. Pools or Redshift on AWS. There are nuances around usage and services, but they pretty much follow this kind of conceptual architecture. Azure Synapse analytics is an unlimited analytics service combining data integration, exploration, warehousing, and big data analytics. This unified platform brings together data engineering, machine learning, and business intelligence needs without the need to maintain separate tools and processes.
In the above architecture, the key themes are as follows –
Ingesting data into the cloud storage layer, specifically into the “raw” zone of the data lake. The data is untyped, untransformed, and has had no cleaning done. Batch data usually comes as CSV files.
The processor then cleans and transforms the data in the lake zones, starting with raw -> enriched -> modified (others may know this pattern as bronze/silver/gold). Enriched is where the data is cleaned, de-duplicated, etc., while Curated is where we create our summary outputs, including facts and dimensions, all in the data lake.
The managed zone is then fed into a cloud data warehouse such as Synapse Dedicated SQL Pools, which act as a service layer for BI tools and analysts.
This pattern is prevalent and is still probably the one I see most often in the field.
So far, so fantastic. Although this is the most common formula, it is not a unicorn approach for everyone. This approach has potential problems, especially regarding handling the data and files in the data lake.
These perceived challenges gave rise to the “Lakehouse” architectural pattern. They were officially featured on the back of an excellent white paper by Databricks, the founders of the Lakehouse pattern in its current form. This white paper highlighted the problems encountered in the current approach and highlighted the following main challenges:
Lack of transactional support
Hard to enforce data quality
It’s hard/complicated to combine adds, updates, and deletes in a data lake
This can lead to data management issues in the lake itself, resulting in data swamps rather than data lakes
It has multiple storage layers – different zones and file types in the lake PLUS inside the data warehouse itself PLUS often in the BI tool as well
So what is Lakehouse?
In short (and this is a bit of an oversimplification, but it will do for this article), the main theme of Lakehouse is 2 things –
Delta Lake – This is the secret sauce of the Lakehouse pattern, and it tries to solve the problems highlighted in the Databricks document. More on that below.
Data lake and Data Warehouse is not different at all. They both are the same thing.
The conceptual architecture for the Lakehouse is shown below –
You see, the main difference in terms of approach is that there is no longer a separate compute engine for the data warehouse, instead we serve our output from the lake itself using the Delta Lake format.
What is Delta Lake?
Delta Lake (the secret sauce mentioned above) is an open source project that enables data warehouse functionality directly ON the data lake and is well summarized in the image below –
Delta Lake provides ACID (atomicity, consistency, isolation, and durability) of transactions to the data lake, allowing you to run safe and complete transactions in the data lake the same way as you would in a database. Delta manages it for you. Features like Time Travel allow you to query data as in a previous state, such as by timestamp or version (similar to SQL time tables).
Delta allows tables to serve as both batch and streaming sinks/sources. Before, you had to create complex lambda architecture patterns with different tools and approaches – with Delta; you can unify it into one, much simpler architecture. There is no separate data warehouse and data lake.
Schema Enforcement allows us to define typed tables with constraints, data types, and so on, ensuring that our data is clean, structured, and aligned with specific data types.
CRUD – Editing, adding, and deleting data is MUCH easier with Delta.
How does Synapse Analytics support Lakehouse?
Synapse Analytics provides several services that allow you to build a Lakehouse architecture using native Synapse services. It’s worth mentioning that you can, of course, also do this with Azure Databricks. Still, this article is specifically aimed at those users who want to use the full range of Synapse services to build specifically on this platform.
For those who don’t know what Synapse is – in a nutshell, it’s a unified cloud-native analytics platform running on Azure that provides a unified approach that exposes different services and tools depending on the use case and user skills. These range from SQL pools to Spark engines to graphical ETL tools and many more.
Synapse Pipelines (essentially a Data Factory under the Synapse umbrella) is a graphical ELT/ELT tool that allows you to orchestrate, move and transform data. You can ingest, transform and load your data into Lakehouse by pulling data from a huge number of sources. Regarding Lakehouse specifically, Synapse Pipelines allow you to take advantage of the Delta Lake format using the Inline Dataset type, which allows you to take advantage of all the benefits of Delta, including upserts, time travel, compression, and more.
Synapse Spark, in terms of the Lakehouse pattern, allows you to develop code-based data engineering notebooks using the language of your choice (SQL, Scala, Pyspark, C#). For those who like code-driven ELT/ELT development, Spark is an excellent and versatile platform that allows you to combine code (e.g., perform a powerful transformation in pyspark, then switch to SQL and perform SQL-like transformations on the same data) and run laptop via Synapse Pipelines.
Like Synapse Pipelines, Synapse Spark uses the Spark runtime 3.2, including Delta Lake 1.0. This will allow you to take advantage of all the options that Delta provides.
The last major service I want to mention is SQL Pools – specifically Serverless SQL Pools – in the Lakehouse pattern. Synapse already has the concept of dedicated SQL Pools, which are massively parallel processing (MPP) provisioned database engines designed to serve as the main serving layer.
However, we do not use a dedicated SQL pool at Lakehouse. Instead, we use another SQL Pool offering inside Synapse, Serverless SQL Pools. These are ideal for Lakehouse as they are paid per query, not always with a calculation. They work by creating a T-SQL layer on top of the data lake, allowing you to write queries and create external objects per lake that external Tools can then consume. As for Lakehouse, serverless SQL pools support the Delta format, allowing you to create external objects such as views and tables directly on top of these Delta structures.
Synapse-centric Lakehouse Architecture
The above shows how we translate these services into an end-to-end architecture using the Synapse platform and Delta Lake.
Data received in a batch or stream, written to a data lake. This data is fed in a raw format like CSV from databases or JSON from event sources. We don’t use Delta until Enriched further.
Synapse Spark and/or (not a binary choice) Synapse Pipelines are used to transform data from raw to enriched and then modified. Which tool you use is up to you and largely depends on the use case and user preferences. Cleaned and transformed data is stored in Delta format in the enriched and curated zones, as this is where our end users will access our Lakehouse.
The underlying storage is Azure Data Lake Storage Gen 2, which is an Azure cloud storage service optimized for analytics and natively integrates with all services used.
Delta tables are created in Enriched/Curated and then exposed through SQL Serverless objects such as views and tables. This allows SQL analysts and BI users to analyze data and create content using a familiar SQL endpoint that points to our Delta tables on the lake.
Data Lake Zones
Source:- https://www.oreilly.com/
The landing zone is a transition layer that may or may not be present. It is often used as a temporary dump for daily rations before ingests into the raw layer. In the latter, we often see data stored in subfolders in the append approach, such as
After raw, we see enriched and curated zones now in Delta format. This takes full advantage of Delta and, more importantly, creates a metadata layer that allows data to be accessed/edited/used with other tools via the DeltaAPI.
ELT approaching Lakehouse
Now I’ll do a quick walkthrough that summarizes the approaches in more detail—specifically, how we ingest, transform, and serve data in Synapse using this Lakehouse pattern.
The income for both streaming and batching is pretty much the same in Lakehouse as in the Modern Cloud DW approach. For streaming, we receive with a message broker tool like Event Hubs and then have in-stream transformation capabilities with Stream Analytics or Synapse Spark. Both are read from the event hub queue and allow you to transform the stream in SQL, Pyspark, etc. before landing on the sink of your choice.
Conclusion
This article summarizes how to build a Lakehouse architecture using Synapse Analytics. When writing (March 2022), many more features have been added to Synapse, particularly in the Lakehouse space. New additions such as the Lake database, database templates, and the shared metadata model will only extend Lakehouse’s existing capabilities.
Finally, this article is by no means suggesting that you need to abandon your cloud data warehouse! Far from it! The Lakehouse pattern is an alternative architecture pattern that doubles on a data lake as the main analytics hub but provides a layer that simplifies historical challenges with data lake-based analytics architectures.
Delta Lake provides ACID (atomicity, consistency, isolation, and durability) of transactions to the data lake, allowing you to run safe and complete transactions in the data lake the same way as you would in a database.
Delta tables are created in Enriched/Curated and then exposed through SQL Serverless objects such as views and tables. This allows SQL analysts and BI users to analyze data and create content using a familiar SQL endpoint that points to our Delta tables on the lake.
Synapse Spark, in terms of the Lakehouse pattern, allows you to develop code-based data engineering notebooks using the language of your choice. For those who like code-driven ELT/ELT development, Spark is an excellent and versatile platform that allows you to combine code and run laptop via Synapse Pipelines.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.
I am a Machine Learning Enthusiast. Done some Industry level projects on Data Science and Machine Learning. Have Certifications in Python and ML from trusted sources like data camp and Skills vertex. My Goal in life is to perceive a career in Data Industry.
We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.
Show details
Powered By
Cookies
This site uses cookies to ensure that you get the best experience possible. To learn more about how we use cookies, please refer to our Privacy Policy & Cookies Policy.
brahmaid
It is needed for personalizing the website.
csrftoken
This cookie is used to prevent Cross-site request forgery (often abbreviated as CSRF) attacks of the website
Identityid
Preserves the login/logout state of users across the whole site.
sessionid
Preserves users' states across page requests.
g_state
Google One-Tap login adds this g_state cookie to set the user status on how they interact with the One-Tap modal.
MUID
Used by Microsoft Clarity, to store and track visits across websites.
_clck
Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.
_clsk
Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.
SRM_I
Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.
SM
Use to measure the use of the website for internal analytics
CLID
The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.
SRM_B
Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.
_gid
This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.
_ga_#
Used by Google Analytics, to store and count pageviews.
_gat_#
Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.
collect
Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.
AEC
cookies ensure that requests within a browsing session are made by the user, and not by other sites.
G_ENABLED_IDPS
use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.
test_cookie
This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.
_we_us
this is used to send push notification using webengage.
WebKlipperAuth
used by webenage to track auth of webenagage.
ln_or
Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.
JSESSIONID
Use to maintain an anonymous user session by the server.
li_rm
Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.
AnalyticsSyncHistory
Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.
lms_analytics
Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.
liap
Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.
visit
allow for the Linkedin follow feature.
li_at
often used to identify you, including your name, interests, and previous activity.
s_plt
Tracks the time that the previous page took to load
lang
Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings
s_tp
Tracks percent of page viewed
AMCV_14215E3D5995C57C0A495C55%40AdobeOrg
Indicates the start of a session for Adobe Experience Cloud
s_pltp
Provides page name value (URL) for use by Adobe Analytics
s_tslv
Used to retain and fetch time since last visit in Adobe Analytics
li_theme
Remembers a user's display preference/theme setting
li_theme_set
Remembers which users have updated their display / theme preferences
We do not use cookies of this type.
_gcl_au
Used by Google Adsense, to store and track conversions.
SID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
SAPISID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
__Secure-#
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
APISID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
SSID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
HSID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
DV
These cookies are used for the purpose of targeted advertising.
NID
These cookies are used for the purpose of targeted advertising.
1P_JAR
These cookies are used to gather website statistics, and track conversion rates.
OTZ
Aggregate analysis of website visitors
_fbp
This cookie is set by Facebook to deliver advertisements when they are on Facebook or a digital platform powered by Facebook advertising after visiting this website.
fr
Contains a unique browser and user ID, used for targeted advertising.
bscookie
Used by LinkedIn to track the use of embedded services.
lidc
Used by LinkedIn for tracking the use of embedded services.
bcookie
Used by LinkedIn to track the use of embedded services.
aam_uuid
Use these cookies to assign a unique ID when users visit a website.
UserMatchHistory
These cookies are set by LinkedIn for advertising purposes, including: tracking visitors so that more relevant ads can be presented, allowing users to use the 'Apply with LinkedIn' or the 'Sign-in with LinkedIn' functions, collecting information about how visitors use the site, etc.
li_sugr
Used to make a probabilistic match of a user's identity outside the Designated Countries
MR
Used to collect information for analytics purposes.
ANONCHK
Used to store session ID for a users session to ensure that clicks from adverts on the Bing search engine are verified for reporting purposes and for personalisation
We do not use cookies of this type.
Cookie declaration last updated on 24/03/2023 by Analytics Vidhya.
Cookies are small text files that can be used by websites to make a user's experience more efficient. The law states that we can store cookies on your device if they are strictly necessary for the operation of this site. For all other types of cookies, we need your permission. This site uses different types of cookies. Some cookies are placed by third-party services that appear on our pages. Learn more about who we are, how you can contact us, and how we process personal data in our Privacy Policy.