Modern Data Engineering with MAGE: Empowering Efficient Data Processing

sri Last Updated : 21 Jun, 2023

9 min read

Introduction

In today’s data-driven world, organizations across industries are dealing with massive volumes of data, complex pipelines, and the need for efficient data processing. Traditional data engineering solutions, such as Apache Airflow, have played an important role in orchestrating and controlling data operations in order to tackle these difficulties. However, with the rapid evolution of technology, a new contender has emerged on the scene, Mage, to reshape the landscape of data engineering.

Learning Objectives

To integrate and synchronize 3rd party data seamlessly
To build real-time and batch pipelines in Python, SQL, and R for transformation
A modular code that’s reusable and testable with data validations
To run, monitor and orchestrate several pipelines while you are sleeping
Collaborate on cloud, version control with Git, and test pipelines without waiting for an available shared staging environment
Fast deployments on Cloud providers like AWS, GCP, and Azure via terraform templates
Transform very large datasets directly in your data warehouse or through native integration with Spark
With built-in monitoring, alerting, and observability through an intuitive UI

Wouldn’t it be so easy as falling off a log? Then you should definitely try Mage!

In this article, I will talk about the features and functionalities of Mage, highlighting what I have learned so far and the first pipeline I’ve built using it.

This article was published as a part of the Data Science Blogathon.

Introduction
Step 1: Quick Installation
Step 2: Build
Step 3: Preview/ Analytics
Step 4: Launching
Miscellaneous Differences
Conclusion
Frequently Asked Questions

What is Mage?

Mage is a modern data orchestration tool powered by AI and built on Machine Learning models and aims to streamline and optimize data engineering processes like never before. It is an effortless yet effective open-source data pipeline tool for data transformation and integration and can be a compelling alternative to well-established tools like Airflow. By combining the power of automation and intelligence, Mage revolutionizes the data processing workflow, transforming the way data is handled and processed. Mage strives to simplify and optimize the data engineering process, unlike anything that has come before with its unmatched capabilities and user-friendly interface.

Step 1: Quick Installation

The mage can be installed using Docker, pip, and conda commands, or can be hosted on Cloud services as a Virtual Machine.

Using Docker

#Command line for installing Mage using Docker
>docker run -it -p 6789:6789 -v %cd%:/home/src mageai/mageai /app/run_app.sh mage start [project_name]

#Command line for installing Mage locally at on a different port
>docker run -it -p 6790:6789 -v %cd%:/home/src mageai/mageai /app/run_app.sh mage start [project_name]

Using Pip

#installing using pip command
>pip install mage-ai
>mage start [project_name]

#installing using conda
>conda install -c conda-forge mage-ai

There are also additional packages for installing Mage using Spark, Postgres, and many more. In this example, I have used Google Cloud Compute Engine to access Mage(as VM) via SSH. I ran the following commands after installing the necessary Python packages.

#Command for installing Mage
~$ mage sudo pip3 install mage-ai
#Command for starting the project
~$ mage start nyc_trides_project

Checking port 6789...
Mage is running at http://localhost:6789 and serving project /home/srinikitha_sri789/nyc_trides_proj
INFO:mage_ai.server.scheduler_manager:Scheduler status: running.

Step 2: Build

Mage provides several blocks with built-in code that has test cases, which can be customized as per your project requirements.

I used Data Loader, Data Transformer, and Data Exporter blocks (ETL) to load the data from API, transforming the data and exporting it to Google Big Query for further analysis.

Let’s learn how each block works.

I) Data Loader

The “Data Loader” block serves as a bridge between the data source and succeeding stages of data processing within the pipeline. The data loader ingests data from sources and transforms it into a suitable format to make it available for further processing.

Key Functionalities

Data Source Connectivity: The data Loader block allows connectivity to a wide range of databases, APIs, Cloud Storage Systems(Azure Blob Storage, GBQ, GCS, MySQL, S3, Redshift, Snowflake, Delta Lake, etc), and other streaming platforms.
Data Quality Checks and Error Handling: During the data loading process, it performs data quality checks to ensure that the data is accurate, consistent, and compliant with established validation standards. The provided data pipeline logic can be used to log, flag, or address any errors or abnormalities that are discovered.
Metadata Management: The metadata related to the ingested data is managed and captured by the data loader block. The data source, extraction timestamp, data schema, and other facts are all included in this metadata. Data lineage, auditing, and tracking of data transformations across the pipeline are made easier by effective metadata management.

The screenshot below displays the loading of raw data from the API into Mage using the data loader. After executing the data loader code and successfully passing the test cases, the output is presented in a tree structure within the terminal.

Data Loading Stage | Mage | Data Engineering — Data Loading Stage

Executed Data loader Block | Mage | Data Engineering — Executed Data loader Block

II) Data Transformation

The “Data Transformation” block performs manipulations on the incoming data and derive meaningful insights and prepares it for downstream processes. It has a generic code option and a standalone file containing modular code that’s reusable and testable with data validations in Python templates for data exploration, rescaling, and necessary column actions, SQl, and R.

Key Functionalities

Combining of Data: The data transformer block makes it easier to combine and merge data from different sources or different datasets. Data engineers can combine data based on similar key qualities because it enables a variety of joins, including inner joins, outside joins, and cross joins. When undertaking data enrichment or merging data from several sources, this capability is really helpful.
Custom Functions: It allows to define and apply customized functions and expressions to manipulate the data. You can leverage built-in functions or write user-defined functions for advanced data transformations.

After loading the data, the transformation code will perform all the necessary manipulations(in this example – converting a flat file to fact and dimension tables) and transforms the code to the data exporter. After executing the data transformation block, the tree diagram is shown below

Data Transformation Stage | Mage | Data Engineering — Data Transformation Stage

Executed Data Transformation Block | Mage | Data Engineering — Executed Data Transformation Block

III) Data Exporter

The “Data Exporter” block exports and delivers processed data to various destinations or systems for further consumption, analysis, or storage. It ensures seamless data transfer and integration with external systems. We can export the data to any storage using default templates provided for Python (API, Azure Blob Storage, GBQ, GCS, MySQL, S3, Redshift, Snowflake, Delta Lake, etc), SQL, and R.

Key Functionalities

Schema Adaptation: It allows engineers to adapt the format and schema of the exported data to meet the requirements of the destination system.
Batch Processing and Streaming: Data Exporter block works in both batch and streaming modes. It facilitates batch processing by exporting data at predefined intervals or based on specific triggers. Additionally, it supports real-time streaming of data, enabling continuous and nearly instantaneous data transfer to downstream systems.
Compliance: It has features such as encryption, access control and data masking to protect sensitive information while exporting the data.

Following the data transformation, we export the transformed/processed data to Google BigQuery using Data Exporter for advanced analytics. Once the data exporter block is executed, the tree diagram below illustrates the subsequent steps.

Exporting Data Stage | Mage | Data Engineering — Exporting Data Stage

Executed pipeline | Mage | Data Engineering — Executed pipeline

Step 3: Preview/ Analytics

The “Preview” phase enables data engineers to inspect and preview processed or intermediate data at a given point in the pipeline. It offers a beneficial chance to check the accuracy of the data transformations, judge the quality of the data, and learn more about the data.

During this phase, each time we run the code, we receive feedback in the form of charts, tables, and graphs. This feedback allows us to gather valuable insights and information. We can immediately see results from your code’s output with an interactive notebook UI. In the pipeline, every block of code generates data that we can version, partition, and catalog for future utilization.

Key Functionalities

Data Visualization
Data Sampling
Data Quality Assessment
Intermediate Results Validation
Iterative Development
Debugging and troubleshooting

Chart 1 | Preview/Analytics | Mage | Data Engineering — Chart 1

Chart 2 | Preview/Analytics | Mage | Data Engineering — Chart 2

Step 4: Launching

In the data pipeline, the “Launch” phase represents the final step where we deploy the processed data into production or downstream systems for further analysis. This phase ensures that the data is directed to the appropriate destination and made accessible for the intended use cases.

Key Functionalities

Data Deployment
Automation and Scheduling
Monitoring and Alerting
Versioning and Rollback
Performance Optimization
Error Handling

You can deploy Mage to AWS, GCP, or Azure with only 2 commands using maintained Terraform templates, and can transform very large datasets directly in your data warehouse or through native integration with Spark and operationalize your pipelines with built-in monitoring, alerting, and observability.

The below screenshots show the total runs of pipelines and their statuses such as successful or failed, logs of each block, and its level.

Monitoring total pipeline runs | Launching | Mage | Data Engineering — Monitoring total pipeline runs

Furthermore, Mage prioritizes data governance and security. It provides a secure environment for data engineering operations. Thanks to sophisticated built-in security mechanisms such as end-to-end encryption, access limits, and auditing capabilities. Mage’s architecture is based on strict data protection rules and best practices, protecting data integrity and confidentiality. Additionally, you can apply real-world use cases and success stories that highlight Mage’s potential in a variety of industries, including finance, e-commerce, healthcare, and others.

Miscellaneous Differences

MAGE	OTHER SOFTWARES
Mage is an engine for running data pipelines that can move and transform data. That data can then be stored anywhere (e.g. S3) and used to train models in Sagemaker.	Sagemaker: Sagemaker is a fully managed ML service used to train machine learning models.
Mage is an open-source data pipeline tool for integration and transforming data(ETL).	Fivetran: Fivetran is a closed-source Saas(software-as-a-service) company providing a managed ETL service.
Mage is an open-source data pipeline tool for integration and transforming data. Mage’s focus is to provide an easy developer experience.	AirByte: AirByte is one of the open-source leading ELT platforms that replicates the data from APIs, applications, and databases to data lakes, data warehouses, and other destinations.

Conclusion

In conclusion, data engineers and analytical experts can efficiently load, transform, export, preview, and deploy data by utilizing the features and functionalities of each phase in the Mage tool and its efficient framework for managing and processing data. This capability enables the facilitation of data-driven decision-making, extraction of valuable insights, and ensures readiness with production or downstream systems. It is widely recognized for its cutting-edge capabilities, scalability, and strong focus on data governance, making it a game-changer for data engineering.

Key Takeaways

Mage provides a pipeline for comprehensive data engineering, which includes data ingestion, transformation, preview, and deployment. This end-to-end platform ensures quick data processing, effective data dissemination, and seamless connectivity.
Mage’s data engineers have the capability to apply various operations during the data transformation phase, ensuring that the data is cleansed, enriched, and prepared for subsequent processing. The preview stage allows for validation and quality assessment of the processed data, guaranteeing its accuracy and reliability.
Throughout the data engineering pipeline, Mage gives efficiency and scalability top priority. To improve performance, it makes use of optimization techniques such as parallel processing, data partitioning, and caching.
Mage’s launch stage enables the effortless transfer of processed data to downstream or production systems. It has tools for automation, versioning, resolving errors, and performance optimization, providing trustworthy and timely data transmission.

Frequently Asked Questions

Q1. What is the difference between Mage data pipeline and other software like Airflow, Dagstar, etc.?

A. Features that set Mage apart (some of the others might eventually have these features):
1. Easy UI/IDE for building and managing data pipelines. When you build your data pipeline, it runs exactly the same in development as it does in production. Deploying the tool and managing the infrastructure in production is very easy and simple, unlike Airflow.
2. Extensible: We designed and built the tool with developers in mind, making sure it’s really easy to add new functionality to the source code or through plug-ins.
3. Modular: Every block/cell you write is a standalone file that is interoperable; meaning it can be used in other pipelines or in other code bases.

Q2.What languages does Mage support?

A.Mage currently supports Python, SQL, R, PySpark, and Spark SQL(in the future).

Q3. What’s the difference between Mage and Databricks?

A. Databricks provides an infrastructure to run Spark. They also provide notebooks that can run your code in Spark as well. The mage can execute your code in a Spark cluster, managed by AWS, GCP, or even Databricks.

Q4. Can Mage integrate with existing data infrastructure and tools?

A. Yes, Mage seamlessly integrates with existing data infrastructure and tools as part of its design. It supports various data storage platforms, databases, and APIs, allowing for smooth integration with your preferred systems.

Q5. Does Mage works for both small-scale and enterprise-level data processing?

A. Yes, Mage can handle. It can accommodate fluctuating data volumes and processing needs because of its scalability and performance optimization capabilities, making it suited for organizations of various sizes and data processing complexity levels.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

sri

Pursuing a Master's in Data Analytics Engineering from Northeastern University. With a Bachelor's in Computer Science, I possess a strong foundation in programming, data structures, and software engineering. I have professional experience as an Assistant Data Engineer at Tata Consultancy Services, where I maintained ETL pipelines, automated workflows, and ensured high uptime of AI tools on AWS with Agile Management. I have also completed several data-driven projects leveraging machine learning algorithms and data visualization techniques. With excellent problem-solving skills and attention to detail, I strive to contribute to data-driven solutions that make a positive impact on society.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Modern Data Engineering with MAGE: Empowering Efficient Data Processing

Introduction

Learning Objectives

Table of contents

What is Mage?

Step 1: Quick Installation

Using Docker

Using Pip

Step 2: Build

I) Data Loader

Key Functionalities

II) Data Transformation

Key Functionalities

III) Data Exporter

Key Functionalities

Step 3: Preview/ Analytics

Key Functionalities

Step 4: Launching

Key Functionalities

Miscellaneous Differences

Conclusion

Key Takeaways

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)