Apache Airflow is a must-have tool for Data Engineers. It makes it easier to create and monitor all your workflows. When you have multiple workflows, there are higher chances that you might be using the same databases and same file paths for multiple workflows. Using variables is one of the most efficient ways to define such shared information among different workflows. In this article, you will get to know about apache airflow, What is apache airflow, is apache airflow is free, what are variables of it.
We will cover the concept of variables in this article and an example of a Python Operator in Apache Airflow.
This article is in continuation of the Data Engineering 101 – Getting Started with Apache Airflow where we covered the features and components of airflow databases, installation steps, and created a basic DAG. So if you are a complete beginner in Apache Airflow, I would recommend you go through that article first.
Apache Airflow is a workflow engine that will easily schedule and run your complex data pipelines. It will make sure that each task of your data pipeline will get executed in the correct order and each task gets the required resources.
It will provide you an amazing user interface to monitor and fix any issues that may arise.
We have already discussed the installation steps in the previous article of this series.
To start the airflow server, open the terminal and run the following command. The default port is 8080 and if you are using that port for something else then you can change it.
airflow webserver -p 8080
Now, start the airflow scheduler using the following command in a different terminal. It will monitor all your workflows and triggers them as you have assigned them.
airflow scheduler
Now, make sure that you have a folder name dags in the airflow directory where you will define your DAGS and open the web browser and go open: http://localhost:8080/admin/ and you will see something like this:
An operator describes a single task of the workflow and Operators provide us, different operators, for many different tasks for example BashOperator, PythonOperator, EmailOperator, MySqlOperator, etc. In the last article, we learned how to use the BashOperator to get the live cricket scores and in this, we will see how to use the PythonOperator.
Let’s have a look at the following example:
Let’s start with importing the libraries that we need. We will use the PythonOperator this time.
For each of the DAG, we need to pass one argument dictionary. Here is the description of some of the arguments that you can pass:
Now, we will define the python function that will print a string using an argument and this function will later be used by the PythonOperator.
Now, we will create a DAG object and pass the dag_id which is the name of the DAG and make sure you have not created any DAG with this name before. Pass the arguments that we defined earlier and add a description and schedule_interval which will run the DAG after the specified interval of time
We just have one task for our workflow:
print: In the task, we will print the “Apache Airflow is a must-have tool for Data Engineers” on the terminal using the python function.We will pass the task_id to the PythonOperator object. You will see this name on the nodes of Graph View of your DAG. Pass the python function name to the argument “python_callable” that you want to run and the arguments that you function is using to the parameter “op_kwargs” as a dictionary and finally, the DAG object to which you want to link this task.
Now, when you refresh your Airflow dashboard, you will see your new DAG in the list.
Click on the DAG and open the graph view and you will see something like this. Each of the steps in the workflow will be in a separate box. In this workflow, we only have one step that is print. Run the workflow and wait until its border turns dark green which is an indication that it is completed successfully.Click on the node “print” to get more details about this step and then click on Logs and you will see the output like this.
We know that Airflow can be used to create and manage complex workflows. We can run multiple workflows at the same time. There is a possibility that most of your workflows are using the same database or same file path. Now, if make any changes like changing the path of the directory where use save files or change the configuration of the databases. In that case, you don’t want to go and update each of the DAGS separately.
Airflow provides a solution for this, you can create variables where you can store and retrieve data at runtime in the multiple DAGS. So if any major changes occur, you can just edit your variable and your workflows are good to go.
Open the Airflow dashboard and click on the Admin from the top menu and then click on Variables.
Now, click on Create to create a new variable and a window will open like this. Add the key and value and submit it. Here, I am creating a variable with the key name as data_path and value as the path of any random text file.
Now, we will create a DAG where we will find out the word count of the text data present in this file. When you want to use the variables, you need to import it. Let’s see how to do this:
Then, we will define the function that will use the path from the variable, read it, and calculate the word count.
The rest of the steps are the same as we did earlier, you need to define the DAG and tasks and your workflow is ready to run.
You can see the results in the log and now if you can use this variable in any other DAG and also you can edit it whenever you want and all your DAGS get updated.
In this article, we understood how to use the Python Operator in Apache Airflow, concepts like branching and variables, and how to create them. In the upcoming article, we will create a machine learning project and automate its workflow using Apache Airflow.
Apache is like a big umbrella for many open-source projects. Airflow is a specific tool under this umbrella that helps manage and automate workflows.
Yes, Apache Airflow is a tool that’s good at helping with ETL, which stands for Extract, Transform, and Load. It makes moving and transforming data between systems more accessible.
Airflow isn’t an ETL tool itself, but a conductor for your ETL orchestra. It schedules, manages, and monitors the different stages of your data pipeline, including ETL tasks.
Yes, Apache Airflow is still widely used for scheduling, monitoring, and managing complex data pipelines. It remains popular due to its flexibility, scalability, and strong community support.
Yes, Apache Airflow is free. It is an open-source project under the Apache License 2.0, which means you can use, modify, and distribute it without any cost.
I recommend you go through the following data engineering resources to enhance your knowledge-
If you have any questions related to this article do let me know in the comments section below.
Thanks! This was really useful :)
Great post! I'm really interested in learning more about Apache Airflow and its practical applications. I've been using it for a few weeks now and I'm loving the way it helps me automate my workflows. Can't wait to see more posts on this topic!