In this article, I will walk you through the layers of the Data Platform Architecture. First of all, let’s understand what is a Layer, a layer represents a serviceable part that performs a precise job or set of tasks in the data platform. The different layers of the data platform architecture that we are going to discuss in this article include the Data ingestion layer, Data storage layer, Data processing layer and Analysis, User interface layer, and Data Pipeline layer. If you are new to Data Engineering, then follow these top 9 skills required to be a data engineer.
Source: Author
Source: Author
This is the first layer of the data platform architecture. The Data Collection layer as the name suggests is responsible for connecting to the source systems and bringing data into the data platform in a periodic manner. This layer performs the following tasks:
There are various tools that are available in the market, but some of the popular tools include Google Cloud Data Flow, IBM Streams, Amazon Kinesis and Apache Kafka are some of the tools that are used for data ingestion that supports both batch and streaming modes.
Source: Author
Once the data is gobbled, it needs to be stored and integrated into the data platform like the way we store food in the stomach. For storing and integrating the data, we head towards the second layer of the data platform that is the Data storage layer or Data Integration layer.
Source: Author
This is the second layer of the data platform architecture. The Data collection layer as the name suggests is responsible for storing data for processing and long-term use. Moreover, this layer is also responsible for making data available for processing in both streaming and batch modes. As this layer is responsible for making data available for processing, it needs to be reliable, scalable, high-performing and cost-efficient. IBM DB2, IBM DB2, Microsoft SQL Server, MySQL, Oracle Database, and PostgreSQL are some of the popular relational databases. But nowadays, cloud-based relational databases gained popularity over the recent years, some cloud-based relational databases are IBM DB2, Google Cloud SQL and SQL Azure. In the NoSQL or non-relational database systems on the cloud, we have IBM Cloudant, Redis, MongoDB, Cassandra, and Neo4J. Tools for integration includes IBM’s cloud Pak for Data, IBM’s cloud Pak for Integration and Open Studio. Once the data has been ingested, stored and integrated, it needs to be processed. So with this, we move forward to Data Processing Layer
Source: Author
This is the third layer of the data platform architecture. As the name suggests, this layer is responsible for a processing task. The processing includes data validations, transformations and applying business logic to the data. The processing layer should be able to perform some tasks that include:
The transformation task that usually occurs in this layer include:
There are numerous amount of tools available in the market for performing these operations on the data includes Such as spreadsheets, OpenRefine, Google DataPrep, Watson Studio Refinery, and Trifacta Wrangler. Python and R also offer several libraries and packages that are explicitly created for processing data. It’s very important to know that storage and processing are not always been performed in separate layers. For example, in relational databases, storage and processing are both occur in the same layer while in Big data systems, data is first stored in the Hadoop File Distributed system and then processed in the data processing engine such as Spark.
Source: Author
This is the fourth layer of the data platform architecture. This layer is responsible for delivering the process data to the end-users that including business intelligence analysts and business stakeholders who consume these data with the help of interactive dashboards and reports, moreover, data scientists and data analysts fall under this end-user category that further process this data for the specific use case. This layer needs to support querying tools such as SQL tools and No-SQL tools and programming languages like Python, R and Java and moreover, these layers need to support API’s that can be used to run reports on data for both online and offline processing.
Source: Author
This is the last layer of this architecture, this layer is responsible for implementing and maintaining a continuous flow of data through this data pipeline. It is the layer that has the capability to extract, transform and load tools. There are a number of data pipeline solutions available, most popular among them being Apache Airflow and DataFlow.
In this article, you learned about the layers of a data platform architecture. This is a simplified rendition of a complex architecture that supports a broad spectrum of tasks.