Data Warehousing with Snowflake and Other Alternatives

Trupti Dekate Last Updated : 26 Sep, 2022

7 min read

This article was published as a part of the Data Science Blogathon.

Introduction

Over the past few years, Snowflake has grown from a virtual unknown to a retailer with thousands of customers. Businesses have adopted Snowflake as migration from on-premise enterprise data warehouses (such as Teradata) or a more flexibly scalable and easier-to-manage alternative to an existing cloud data warehouse (such as Amazon Redshift or Google BigQuery).

Because of the significant investment that Snowflake often represents, some data teams are considering it for every use case. This inevitably leads organizations to compare data lakes to Snowflake and other cloud data warehouses. (The data lake is another data platform that has emerged in the last decade, first as an on-premises installation of Hadoop and now in the cloud, built on natural object storage.)

Snowflake: Data lake or Data Warehouse?

Data lakes offer low-cost object storage of raw data and rely on external query tools to analyze large datasets using highly available computing resources. Because they access a file system rather than a structured format, data lakes are not highly performant without optimization. But once optimized, they can be extremely cost-effective, especially at scale. They are also well equipped to process streaming data.

The Data warehouse stores structured data in a proprietary format for analytical access through a tightly coupled query layer. Computational speed is high compared to a non-optimized data lake but is also more expensive.

So, where does the Snowflake fit?

With other data warehouses, Snowflake requires a proper data format, works with structured and semi-structured (but not unstructured) data, and requires its query engine. However, it differs from a traditional data warehouse in two key aspects:

1. It is only offered in the cloud

2. It separates the storage from the elastic compute layer

These improvements initially differentiated Snowflake in the market from Teradata and Redshift. However, each has tried to match those attributes—Teradata with its Vantage cloud service and Amazon with Redshift, which separates compute and storage. As a result, these services now share these two properties with cloud data lakes, which can add to the flurry of confusion about when to use which.

Bottom Line: While Snowflake offers some separation of storage from computing, it cannot be considered a data lake due to its reliance on proprietary data formats and structured storage. A data lake is built on broad access to data and the ability to choose between different compute queries and data tools; despite all Snowflake’s advantages as a warehouse, it doesn’t offer some features.

Where to store the data for cost optimization?

So Snowflake is not a data lake. But is there any reason not to store all your information on it and rely on its ability to process data with SQL quickly? One reason you should consider it is the cost.

Of course, the cost is essential when deciding how to use different analytics platforms. Using Snowflake to run complex queries on large volumes of data at high speed can significantly increase costs.

Snowflake charges based on how long the Virtual Data Warehouse (VDW) is running, plus VDW size (number of cores) and prepaid feature set (Enterprise, Standard, or Business-Critical). A given VDW, the rates are identical to the load on the VDW. This differs from the data lake, which offers spot pricing; a cheaper instance can handle a smaller bag.
Snowflake processes data in its proprietary format, and data transformation for Snowflake acceptance can be costly. Because receiving data streams creates a continuous load, it can keep the Snowflake VDW meter running 24/7. Since Snowflake does not charge differently for a 5% or 80% load of a given virtual datastore size, these costs can be high.

• Computing snowflakes is more expensive than running the same job in a data lake.

So there is a real risk of Snowflake running into an inexorable cost scenario. When you realize that you may have too much data stored in Snowflake, it can become even more expensive to get the data out because it needs to be transformed into a different format.

A similar scenario could arise involving your data scientists, who may repeatedly analyze the same or similar data sets as they experiment with different models and test various hypotheses. This is highly computationally intensive. They must be able to connect to the ML or AI tools of their choice and not be locked into a single technology just because a proprietary data format requires it.

For these reasons, relying on Snowflake for all your data analysis needs can be inefficient and expensive. From a business perspective, it may be better to supplement Snowflake with a data lake.

Compare Snowflake and Databricks

There are a lot of people who are confused between snowflake and Databricks. How does the Snowflake approach compare to Databricks’ self-proclaimed Lakehouse tool, and how stark are the differences between the two? Check out our Databricks vs Snowflake comparison to find out.

Snowflake + optimized cloud data lake = flexible and affordable analytics

None of the above is meant to disparage Snowflake or to suggest that it shouldn’t be part of your overall solution. Snowflake is a handy data warehouse and should be considered a way to provide predictable and fast analytics performance. However, by incorporating it as part of an open, optimized data lake architecture, you can ensure that you get all (or most) of Snowflake’s benefits while keeping your cloud costs under control.

Using an optimized data lake and cloud data warehouse like Snowflake allows companies to apply different patterns to different use cases based on cost and performance requirements.

You can use cheap data lake storage and keep all your data – not just recent or structured data.
Data transformation and preparation are likely much cheaper in a data lake than in Snowflake, especially for streaming data.
You are not limited in choice of query tools and can query the data lake directly or send it to Snowflake as needed. With additional tools such as search engines, time series databases, or ML/AI tools, you retain maximum flexibility and agility to work with the rest of the data as you please.

Prepare your Data Lake for Analysis

Manually preparing data in a data lake using tools like AWS Glue or Apache Spark is usually resource-intensive. Here we will use Upsolver’s Data Lake engineering platform. Upsolver provides a low-code, self-service, optimized compute layer on top of your data lake, making it powerful to serve as a cost-effective repository for analytics services.

Upsolver includes data lake engineering best practices to make processing efficient, automating the essential but time-consuming data pipeline work that every data lake requires to function well. It includes:

Converting data into columnar formats with efficient querying like Apache Parquet instead of requiring engines to query raw data.
Continuous file compression ensures performance by avoiding the “small file problem.”
Appropriate data partitioning to speed up query response.

Upsolver uses low-cost compute options such as AWS Spot EC2 instances whenever possible to reduce query costs. This can reduce calculation costs by 10X compared to standard EC2, which is much cheaper than a data warehouse.

Upsolver handles UPSERTS tables correctly, meaning you can continuously load tables from streaming data that stay current as the data and even the schema change.

Use Upsolver to normalize, filter, aggregate, and join data to your liking through a visual SQL IDE to create transformations. Then run these transformations directly in the data lake before writing them to Snowflake (including joins, aggregations, enrichments, etc.) or querying them as an external table from Snowflake’s SQL query engine.

Combining Snowflake with Upsolver’s data-rich data lake gives you the flexibility to run processing where it makes the most sense in terms of cost and choice of analytics tools.

You can even use Upsolver to continuously transform and stream data directly to Snowflake.

How to build a real-time streaming architecture using Snowflake, Data Lake Storage, and Upsolver on AWS?

First, and most importantly, be clear about which of your data streams must go to Snowflake and which can be stored in raw format in a lake for other purposes. (Remember not to keep ancient data on Snowflake; Athena is a much better tool for querying large datasets.) Then design the architecture so that the transformed streaming data is automatically sent to Snowflake.

In this reference architecture, Snowflake is only one of several data consumers. Data is optimized (sometimes significantly) for each consumer. In some cases, prepared data is most economically queried on a data lake with a query tool such as Athena; in others, the output is to a specialized data store:

• search analysis or logging (Elasticsearch)

• OLAP (Snowflake) querying

• chart analysis (Neptune)

With an open data lake architecture like this, you keep a single version of the truth in the data lake as raw data, plus you can refine and distribute the data for specific purposes.

1. Upsolver receives data streams and stores them in raw format in a data lake.

2. Data intended for OLAP or ad hoc queries are prepared, cleaned, aggregated, and formatted for direct question by Athena and output to Snowflake for processing and processing.

3. As requests come in from other analytics services, the data is prepared and then delivered to the service.

Conclusion

Snowflake is a new kind of data warehouse. It offers advantages over traditional approaches through a consumption-based pricing model and separation of storage and computing. However, it is not the best tool for every task, especially regarding costs, such as running continuous errands on streaming data. It is also a closed system, which reduces the long-term flexibility of the analysis.

Data lakes are open, more scalable, cost-effective, and can support a broader range of business cases. Now, you can use a codeless data lake engineering platform to create a pipeline to receive, prepare quickly, and format unlimited amounts of streaming data. You can then directly query or send to Snowflake—the best of both worlds.

The Data warehouse stores structured data in a proprietary format for analytical access through a tightly coupled query layer. Computational speed is high compared to a non-optimized data lake but is also more expensive.
Snowflake offers some separation of storage from computing. It cannot be considered a data lake due to its reliance on proprietary data formats and structured storage. A data lake is built on broad access to data and the ability to choose between different compute queries and data tools;
Snowflake charges based on how long the Virtual Data Warehouse (VDW) is running, plus VDW size (number of cores) and prepaid feature set (Enterprise, Standard, or Business-Critical). A given VDW, the rates are identical to the load on the VDW.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Trupti Dekate

I am an Accountant at Global private Analytics Services working with the Data Analysis Team for handling the budget of various Growing Companies. We provide service of analytics and made the work of new tech companies easy by helping them manage their total investment and giving suggestions.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Data Warehousing with Snowflake and Other Alternatives

Introduction

Snowflake: Data lake or Data Warehouse?

So, where does the Snowflake fit?

Where to store the data for cost optimization?

Compare Snowflake and Databricks

Prepare your Data Lake for Analysis

How to build a real-time streaming architecture using Snowflake, Data Lake Storage, and Upsolver on AWS?

Conclusion

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect