Basics of Data Modeling and Warehousing for Data Engineers

Chetan Last Updated : 25 Jul, 2022

6 min read

This article was published as a part of the Data Science Blogathon.

Introduction

Companies struggle to manage and report all their data. Even asking basic questions like “how many customers we have in some places,” or “what product do our customers in their 20s buy the most” can be a challenge. The data repository should have solved these challenges. The concept of data storage has changed dramatically since it began in the mid-1980s. It has evolved into a separate discipline to meet the growing challenges and complexities of the business world. This has led to better technology and stronger business practices.

Initially, data warehouses were created so that companies could store a source of analytical data that they could use to answer queries. This is still an important factor, but today, companies need easy access to information on a large scale with a diverse set of end-users. The defined user has greatly expanded from specialized engineers to almost anyone who can drag and drop to the Tableau.

Understanding end-users of data storage are essential if you are planning to build one. It can be easy with modern tools to pull data from Snowflake or BiQuery without prioritizing the end-user, but the goal should be to create a basic layer of data that is easy to understand for anyone. At the end of the day, data is a product of a data group and needs to be as understandable, reliable, and easy to use as any other feature or product.

Data as Product

Data is just a useful product. That new oil data still weighs, but the data is really current. It is expected to work. We do not want crude oil. We want high octane fuel. We want to be able to just plug the fuel into our car and operate it without any problems. As people get closer and closer to this product, it needs to be used. This means it should be:

It is easy to understand,
and easy to operate.
It’s solid.
He is faithful.
It’s on time

Evaluating company data processes can greatly improve the end-user experience with specified data. Overall, the data management part as a product follows advanced processes that help capture data from crude oil to high octane fuel.

Best Data Model Practices

When building your data repository, it is important to use best practices. However, it is also important that you do not overreact. Many warehouse data solutions do not even easily support (or are not configured) some of the most common methods of data matching. But this does not mean that you can automatically upload data to your database without standards or modeling. You do not need to be assertive, but be consistent. If you are going to make a mistake, make the same mistake. This means you will need to set standards to make it clear to developers what they expect.

Basic advanced practices, such as common names, can make a huge difference in the knowledge of end-users.

Standardize names – To ensure that analysts can quickly identify what the columns mean, having common naming conventions is required. Using consistent data types such as “ts”, “date”, “is_’ and so on, ensures that everyone knows what they are looking at without looking at the data documents. This is similar to the old design principle of the adjective that describes the column recommendations.
Manage data structures – Overall, trying to avoid complex data structures such as arrays and dictionaries in key layers is beneficial because it reduces the confusion that analysts may have.
Organize IDS as much as you can – IDs allow analysts to integrate data across multiple systems. Looking back on my career, this excellent practice has had a profound effect. When IDs were not suspended, I was completely unable to join the data sets, no matter how talented I was. By comparison, when I worked for companies that had systems in place to ensure that system IDs were tracked, I was able to fluently join very different data sets.
Improve processes with software teams – Less than the best performance and the biggest problem you will face is how you ensure that your data does not change too much. Of course, you can store your data in JSON or non-built-in data sets in the raw layer. But as a data engineer, the more you understand what changes in data platforms and organizations are happening upstream, the more you can avoid any failures.

Higher Levels of Data Modeling Concepts

There are many different schools of thought when it comes to building a data warehouse. Galen B recently published “Read Google Data Developers: Dimensional Data Model Dead”. Now, this has received mixed reviews, as there are still many people who strongly support traditional data modeling, but no matter which camp you live in, it is important to note that there is no fast way to a solid basic data model.

Your data engineering team will need to take some time to understand how data is used, what it stands for, and what it looks like. This will ensure that you create data sets that your colleagues will want to use and use effectively. It all starts with the same stages of data processing.

Here are the key sections for most corporate data matching patterns:

Raw – This layer is usually stored in S3 buckets or perhaps a raw table used as the first layer of data. Teams can then conduct a rapid data test to ensure that all data remains healthy. It can also be reconfigured in case of accidental deletion.
Stage – Some form of pre-data processing is usually inevitable. Next, data teams rely on stage layouts to make the first pass of their data. To some extent or another, there is often duplicate data, heavily embedded data, and newly inconsistently named data standardized on a staging layer. Once the data is processed, there will usually be another QA layer before uploading the data to the main data layer.
Core – This layer is where you will find the company’s database. This is where you can track everything that is done or done in the business to the level of granularity. You can think of this layer as a place where all the various organizations and relationships are kept. It is the foundation on which everything else is built.
Statistics – The analysis layer is usually broadly pre-assembled tables to reduce the number of errors and logical applications that can occur as analysts progress on the main data layer.
Integrated – At the top of the analysis layer there is often a combination of metrics, KPIs, and aggregated data sets that need to be created. These metrics and KPIs are used for reporting as dashboard go-to directors, C-suites, and performance managers who need to make decisions based on changing KPIs over time.

Why Invest In Best Practices

Building a solid data storage system, whether a data warehouse or a data lakehouse, gives you a solid foundation on which to build. Whether you are building a data product or doing research, a well-defined basic data layer allows everyone in the company to build their data products with complete confidence in the accuracy of the data.

Additionally, performing small tasks such as setting up IDs makes it easy to join data across all system data sets. This means that end-users as data analysts and scientists can create analyses in very different data sets.

And we do not even consider all the benefits that people get when column names have an expected skip, such as not spending time analyzing columns before using them because they are not sure what kind of data they should expect from what is being said. column. I know, funny things. While all of these advanced processes take time, they ensure that, over time, a company can make decisions with a high level of data reliability.

Conclusion

Companies invest in centralized data storage systems because they provide easy access to large amounts of data without the need to pull data directly from all data sources, place it in excel, and convert all that data there. That is why companies invest so much in data warehouses, data pools, and data warehouses. Creating any type of data storage system always requires a certain level of configuration and data modeling (in one way or another).

Data modeling is Important before actually starting utilizing that data. It is easy to understand, easy to operate, and It’s solid and faithful.
Data Models Practices like standardizing names, managing data structure, organizing IDE & improve processes with the software team.
Higher level of Data modeling concepts & key sections for most corporate data matching patterns including Raw, core, Stage, and Statistics. I hope you understand the importance of data Modeling and use it.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Chetan

Data Analyst who love to drive insights by visualizing the data and extracting the knowledge from it. Automating various tasks using python & builds Real time Dashboard's using tech like React and node.js. Capable of Creaking complex SQL queries to fetch the accurate data.

Beginner Data Engineering

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Joohnse Komala

standards or modelling are totally different aspects. I couldn't find any smell of data warehousing and any modeling techniques discussed here and you never mentioned how data engineers benefit from it. Please bring the context in the article to align with your topic or other way round. just advise. hope you got my point.

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Basics of Data Modeling and Warehousing for Data Engineers

Introduction

Data as Product

Best Data Model Practices

Higher Levels of Data Modeling Concepts

Why Invest In Best Practices

Conclusion

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS