Layers of the Data Platform Architecture

Harshit Ahluwalia Last Updated : 17 Jan, 2022

5 min read

Overview

In this article, I will walk you through the layers of the Data Platform Architecture. First of all, let’s understand what is a Layer, a layer represents a serviceable part that performs a precise job or set of tasks in the data platform. The different layers of the data platform architecture that we are going to discuss in this article include the Data ingestion layer, Data storage layer, Data processing layer and Analysis, User interface layer, and Data Pipeline layer. If you are new to Data Engineering, then follow these top 9 skills required to be a data engineer.

Source: Author

Table of content

Data Collection Layer or Data ingestion Layer
Data Storage Layer or Integration Layer
Data Processing Layer
Analysis and User Interface Layer
Data Pipeline Layer

Data Collection Layer

Source: Author

This is the first layer of the data platform architecture. The Data Collection layer as the name suggests is responsible for connecting to the source systems and bringing data into the data platform in a periodic manner. This layer performs the following tasks:

This layer is responsible for connecting to the data sources.
This layer is responsible for transferring data from data sources to the data platform in streaming mode or batch mode or both.
Moreover, this layer is responsible for maintaining the information about the data collected in the metadata repository. For example, how much data is gobbled into the data platform and other descriptive information?

There are various tools that are available in the market, but some of the popular tools include Google Cloud Data Flow, IBM Streams, Amazon Kinesis and Apache Kafka are some of the tools that are used for data ingestion that supports both batch and streaming modes.

Source: Author

Once the data is gobbled, it needs to be stored and integrated into the data platform like the way we store food in the stomach. For storing and integrating the data, we head towards the second layer of the data platform that is the Data storage layer or Data Integration layer.

Data Storage and Data Integration Layer

Source: Author

This is the second layer of the data platform architecture. The Data collection layer as the name suggests is responsible for storing data for processing and long-term use. Moreover, this layer is also responsible for making data available for processing in both streaming and batch modes. As this layer is responsible for making data available for processing, it needs to be reliable, scalable, high-performing and cost-efficient. IBM DB2, IBM DB2, Microsoft SQL Server, MySQL, Oracle Database, and PostgreSQL are some of the popular relational databases. But nowadays, cloud-based relational databases gained popularity over the recent years, some cloud-based relational databases are IBM DB2, Google Cloud SQL and SQL Azure. In the NoSQL or non-relational database systems on the cloud, we have IBM Cloudant, Redis, MongoDB, Cassandra, and Neo4J. Tools for integration includes IBM’s cloud Pak for Data, IBM’s cloud Pak for Integration and Open Studio. Once the data has been ingested, stored and integrated, it needs to be processed. So with this, we move forward to Data Processing Layer

Data Processing Layer

Source: Author

This is the third layer of the data platform architecture. As the name suggests, this layer is responsible for a processing task. The processing includes data validations, transformations and applying business logic to the data. The processing layer should be able to perform some tasks that include:

Read data in batch or streaming modes from storage and apply transformations.
Support popular querying tools and programming languages.
Scale to meet the processing demands of a growing dataset.
Provide a way for analysts and data scientists to work with data in the data platform.

The transformation task that usually occurs in this layer include:

Structuring: These are the actions that change the structure of the data. This change can be simple or complex in nature. The simple one can also be like changing the arrangement of fields within the record or dataset or complex as combining fields complex structures using joins and unions.
Normalization: This part focuses on reducing redundancy and inconsistency. It also focuses on cleaning the database of unused data.
Denormalization: Denormalization is the task of combining data from multiple tables into a single table so that it can be queried more efficiently for reporting and analysis purposes.
Data Cleaning: Data Cleaning, which fixes irregularities in data to provide credible data for downstream applications and uses.

There are numerous amount of tools available in the market for performing these operations on the data includes Such as spreadsheets, OpenRefine, Google DataPrep, Watson Studio Refinery, and Trifacta Wrangler. Python and R also offer several libraries and packages that are explicitly created for processing data. It’s very important to know that storage and processing are not always been performed in separate layers. For example, in relational databases, storage and processing are both occur in the same layer while in Big data systems, data is first stored in the Hadoop File Distributed system and then processed in the data processing engine such as Spark.

Analysis and User Interface Layer

Source: Author

This is the fourth layer of the data platform architecture. This layer is responsible for delivering the process data to the end-users that including business intelligence analysts and business stakeholders who consume these data with the help of interactive dashboards and reports, moreover, data scientists and data analysts fall under this end-user category that further process this data for the specific use case. This layer needs to support querying tools such as SQL tools and No-SQL tools and programming languages like Python, R and Java and moreover, these layers need to support API’s that can be used to run reports on data for both online and offline processing.

Data Pipeline Layer

Source: Author

This is the last layer of this architecture, this layer is responsible for implementing and maintaining a continuous flow of data through this data pipeline. It is the layer that has the capability to extract, transform and load tools. There are a number of data pipeline solutions available, most popular among them being Apache Airflow and DataFlow.

End Notes

In this article, you learned about the layers of a data platform architecture. This is a simplified rendition of a complex architecture that supports a broad spectrum of tasks.

Harshit Ahluwalia

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Layers of the Data Platform Architecture

Overview

Table of content

Data Collection Layer

Data Storage and Data Integration Layer

Data Processing Layer

Analysis and User Interface Layer

Data Pipeline Layer

End Notes

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect