Airflow Install

12/1/2020

Now let’s install and activate a python virtual environment. This will allow us to install and update packages without affecting the core machine’s python libraries. sudo pip install virtualenv source venv/bin/activate. Install apache airflow server with s3, all databases, and jdbc support. (venv)pip install 'apache-airflows3, alldbs,jdbc'. Step to install airflow on windows without Docker Step 1: Control Panel Programs and Features Turn Windows features on or off Enable: Windows Subsystem.

Install Airflow Windows
Airflow Install Windows
Airflow Install Packages
Airflow Installation In Ec2

Install Apache Airflow can be installed on computer or you can use it by Cloud/Pass. Depending on your goals choose the best method of installation. Remember that Apache Airflow is a platform created by the Community and you can join us!
How to install and setup Apache Airflow on Ubuntu 16 or 18. Manager Mon, - 09:53. Create a new Ubuntu virtual machine login # sudo su # apt-get update.

In one of our previous blog posts, we described the process you should take when Installing and Configuring Apache Airflow. In this post, we will describe how to setup an Apache Airflow Cluster to run across multiple nodes. This will provide you with more computing power and higher availability for your Apache Airflow instance.

Airflow Daemons

A running instance of Airflow has a number of Daemons that work together to provide the full functionality of Airflow. The daemons include the Web Server, Scheduler, Worker, Kerberos Ticket Renewer, Flower and others. Bellow are the primary ones you will need to have running for a production quality Apache Airflow Cluster.

Web Server

A daemon which accepts HTTP requests and allows you to interact with Airflow via a Python Flask Web Application. It provides the ability to pause, unpause DAGs, manually trigger DAGs, view running DAGs, restart failed DAGs and much more.

The Web Server Daemon starts up gunicorn workers to handle requests in parallel. You can scale up the number of gunicorn workers on a single machine to handle more load by updating the ‘workers’ configuration in the {AIRFLOW_HOME}/airflow.cfg file.

Example

Startup Command:

Scheduler

A daemon which periodically polls to determine if any registered DAG and/or Task Instances needs to triggered based off its schedule.

Startup Command:

Executors/Workers

A daemon that handles starting up and managing 1 to many CeleryD processes to execute the desired tasks of a particular DAG.

This daemon only needs to be running when you set the ‘executor ‘ config in the {AIRFLOW_HOME}/airflow.cfg file to ‘CeleryExecutor’. It is recommended to do so for Production.

Example:

Install Airflow Windows

Startup Command:

How do the Daemons work together?

One thing to note about the Airflow Daemons is that they don’t register with each other or even need to know about each other. Each of them handle their own assigned task and when all of them are running, everything works as you would expect.

The Scheduler periodically polls to see if any DAGs that are registered in the MetaStore need to be executed. If a particular DAG needs to be triggered (based off the DAGs Schedule), then the Scheduler Daemon creates a DagRun instance in the MetaStore and starts to trigger the individual tasks in the DAG. The scheduler will do this by pushing messages into the Queueing Service. Each message contains information about the Task it is executing including the DAG Id, Task Id and what function needs to be performed. In the case where the Task is a BashOperator with some bash code, the message will contain this bash code.
A user might also interact with the Web Server and manually trigger DAGs to be ran. When a user does this, a DagRun will be created and the scheduler will start to trigger individual Tasks in the DAG in the same way that was mentioned in #1.
The celeryd processes controlled by the Worker daemon, will pull from the Queueing Service on regular intervals to see if there are any tasks that need to be executed. When one of the celeryd processes pulls a Task message, it updates the Task instance in the MetaStore to a Running state and tries to execute the code provided. If it succeeds then it updates the state as succeeded but if the code fails while being executed then it updates the Task as failed.

Single Node Airflow Setup

A simple instance of Apache Airflow involves putting all the services on a single node like the bellow diagram depicts.

Multi-Node (Cluster) Airflow Setup

A more formal setup for Apache Airflow is to distribute the daemons across multiple machines as a cluster.

Benefits

Higher Availability

If one of the worker nodes were to go down or be purposely taken offline, the cluster would still be operational and tasks would still be executed.

Distributed Processing

If you have a workflow with several memory intensive tasks, then the tasks will be better distributed to allow for higher utilizaiton of data across the cluster and provide faster execution of the tasks.

Scaling Workers

Horizontally

You can scale the cluster horizontally and distribute the processing by adding more executor nodes to the cluster and allowing those new nodes to take load off the existing nodes. Since workers don’t need to register with any central authority to start processing tasks, the machine can be turned on and off without any downtime to the cluster.

Vertically

You can scale the cluster vertically by increasing the number of celeryd daemons running on each node. This can be done by increasing the value in the ‘celeryd_concurrency’ config in the {AIRFLOW_HOME}/airflow.cfg file.

Example:

You may need to increase the size of the instances in order to support a larger number of celeryd processes. This will depend on the memory and cpu intensity of the tasks you’re running on the cluster.

Scaling Master Nodes

You can also add more Master Nodes to your cluster to scale out the services that are running on the Master Nodes. This will mainly allow you to scale out the Web Server Daemon incase there are too many HTTP requests coming for one machine to handle or if you want to provide Higher Availability for that service.

One thing to note is that there can only be one Scheduler instance running at a time. If you have multiple Schedulers running, there is a possibility that multiple instances of a single task will be scheduled. This could cause some major problems with your Workflow and cause duplicate data to show up in the final table if you were running some sort of ETL process.

If you would like, the Scheduler daemon may also be setup to run on its own dedicated Master Node.

Apache Airflow Cluster Setup Steps

Pre-Requisites

The following nodes are available with the given host names:
- master1
  - Will have the role(s): Web Server, Scheduler
- master2
  - Will have the role(s): Web Server
- worker1
  - Will have the role(s): Worker
- worker2
  - Will have the role(s): Worker
A Queuing Service is Running. (RabbitMQ, AWS SQS, etc)
- You can install RabbitMQ by following these instructions: Installing RabbitMQ
  - If you’re using RabbitMQ, it is recommended that it is also setup to be a cluster for High Availability. Setup a Load Balancer to proxy requests to the RabbitMQ instances.

Steps

Install Apache Airflow on ALL machines that will have a role in the Airflow
- You can follow the instructions on our previous blog post: Installing and Configuring Apache Airflow.
Apply Airflow Configuration changes to all ALL machines. Apply changes to the {AIRFLOW_HOME}/airflow.cfg file.
1. Change the Executor to CeleryExecutor
2. Point SQL Alchemy to the MetaStore
3. Set the Broker URL
  1. If you’re using RabbitMQ:
  2. If you’re using AWS SQS:
4. Point Celery to the MetaStore
Deploy your DAGs/Workflows on master1 and master2 (and any future master nodes you might add)
On master1, initialize the Airflow Database (if not already done after updating the sql_alchemy_conn configuration)
On master1, startup the required role(s)
- Startup Web Server
- Startup Scheduler
On master2, startup the required role(s)
- Startup Web Server
On worker1 and worker2, startup the required role(s)
- Startup Worker
Create a Load Balancer to balance requests going to the Web Servers
- If you’re in AWS you can do this with the EC2 Load Balancer
  - Sample Configurations:
    - Port Forwarding
      - Port 8080 (HTTP) → Port 8080 (HTTP)
    - Health Check
      - Protocol: HTTP
      - Ping Port: 8080
      - Ping Path: /
      - Success Code: 200,302
- If you’re not on AWS you can use something like haproxy to proxy/balance requests to the Web Servers
  - Sample Configurations:
You’re done!

Additional Documentation

Documentation: https://airflow.incubator.apache.org/

Install Documentation: https://airflow.incubator.apache.org/installation.html

GitHub Repo: https://github.com/apache/incubator-airflow

This post will describe how you can deploy Apache Airflow using the Kubernetes executor on Azure Kubernetes Service (AKS). It will also go into detail about registering a proper domain name for airflow running on HTTPS. To get the most out of this post basic knowledge of helm, kubectl and docker is advised as it the commands won't be explained into detail here. In short, Docker is currently the most popular container platform and allows you to isolate and pack self-contained environments. Kubernetes (accessible via the command line tool kubectl) is a powerful and comprehensive platform for orchestrating docker containers. Helm is a layer on top of kubectl and is an application manager for Kubernetes, making it easy to share and install complex applications on Kubernetes.

Getting started

To get started and follow along:

Clone the Airflow docker image repository
Clone the Airflow helm chart

Make a copy of ./airflow-helm/airflow.yaml to ./airflow-helm/airflow-local.yaml. We'll be modifying this file throughout this guide.

Airflow Install Windows

Kubernetes Executor on Azure Kubernetes Service (AKS)

The kubernetes executor for Airflow runs every single task in a separate pod. It does so by starting a new run of the task using the airflow run command in a new pod. The executor also makes sure the new pod will receive a connection to the database and the location of DAGs and logs.

AKS is a managed Kubernetes service running on the Microsoft Azure cloud. It is assumed the reader has already deployed such a cluster – this is in fact quite easy using the quick start guide. Note: this guide was written using AKS version 1.12.7.

The fernet key in Airflow is designed to communicate secret values from database to executor. If the executor does not have access to the fernet key it cannot decode connections. To make sure this is possible set the following value in airflow-local.yaml:

Use this to generate a fernet key.

This setting will make sure the fernet key gets propagated to the executor pods. This is done by the Kubernetes Executor in Airflow automagically.

Deploying with Helm

In addition to the airflow-helm repository make sure your kubectl is configured to use the correct AKS cluster (if you have more than one). Assuming you have a Kubernetes cluster called aks-airflow you can use the azure CLI or kubectl.

respectively. Note that the latter one only works if you've invoked the former command at least once.

Azure Postgres

Airflow Install Packages

To make full use of cloud features we'll be connected to a managed Azure Postgres instance. If you don't have one, the quick start guide is your friend. All state for Airflow is stored in the metastore. Choosing this managed database will also take care of backups, which is one less thing to worry about.

Now, the docker image used in the helm chart uses an entrypoint.sh which makes some nasty assumptions:

Airflow Installation In Ec2

It creates the AIRFLOWCORESQL_ALCHEMY_CONN value given the postgres host, port, user and password.
The issue is that it expects an unencrypted (no SSL) connection by default. Since this blog post uses an external postgres instance we must use SSL encryption.

The easiest solution to this problem is to modify the Dockerfile and completely remove the ENTRYPOINT and CMD line. This does involve creating your own image and pushing it to your container registry. The Azure Container Registry (ACR) would serve that purpose very well.

We can then proceed to create the user and database for Airflow using psql. The easiest way to do this is to login to the Azure Portal open a cloud shell and connect to the postgres database with your admin user. From here you can create the user, database and access rights for Airflow with

You can then proceed to set the following value (assuming your postgres instance is called posgresdbforairflow) in airflow-local.yaml:

Note the sslmode=require at the end, which tells Airflow to use an encrypted connection to postgres.

Since we use a custom image we have to tell this to helm. Set the following values in airflow-local.yaml:

Note the acr-auth pull secret. You can either create this yourself or – better yet – let helm take care of it. To let helm create the secret for you, set the following values in airflow-local.yaml:

The sqlalchemy connection string is also propagated to the executor pods. Like the fernet key, this is done by the Kubernetes Executor.

Persistent logs and dags with Azure Fileshare

Microsoft Azure provides a way to mount SMB fileshares to any Kubernetes pod. To enable persistent logging we'll be configuring the helm chart to mount an Azure File Share (AFS). Out of scope is setting up logrotate, this is highly recommended since Airflow (especially the scheduler) generates a LOT of logs. In this guide, we'll also be using an AFS for the location of the dags.

Set the following values to enable logging to a fileshare in airflow-local.yaml.

The name of shareName must match an AFS that you've created before deploying the helm chart.

Now everything for persistent logging and persistent dags has been setup.

This concludes all work with helm and airflow is now ready to be deployed! Run the following command from the path where your airflow-local.yaml is located:

The next step would be to exec -it into the webserver or scheduler pod and creating Airflow users. This is out of scope for this guide.

Learn Spark or Python in just one day

Develop Your Data Science Capabilities. **Online**, instructor-led on 23 or 26 March 2020, 09:00 - 17:00 CET.

FQDN with Ingress controller

Airflow is currently running under it's own service and IP in the cluster. You could go into the web server by port-forward-ing the pod or the service using kubectl. But much nicer is assigning a proper DNS name to airflow and making it reachable over HTTPS. Microsoft Azure has an excellent guide that explains all the steps needed to get this working. Everything below – up to the 'Chaoskube' section – is a summary of that guide.

Deploying an ingress controller

If your AKS cluster is configured without RBAC you can use the following command to deploy the ingress controller.

This will configure a publicly available IP address to an NGINX pod which currently points to nothing. We'll fix that. You can get this IP address become available by watching the services:

Configuring a DNS name

Using the IP address created by the ingress controller you can now register a DNS name in Azure. The following bash commands take care of that:

Make it HTTPS

Now let's make it secure by configuring a certificate manager that will automatically create and renew SSL certificates based on the ingress route. The following bash commands takes care of that:

Next, install letsencrypt to enable signed certificates.

After the script has completed you now have a DNS name pointing to the ingress controller and a signed certificate. The only step remaining to make airflow accessible is configuring the controller to make sure it points to the well hidden airflow web service. Create a new file called ingress-routes.yaml containing

Run

to install it.

Now airflow is accessible over HTTPS on https://yourairflowdnsname.yourazurelocation.cloudapp.azure.com

Cool!

Chaoskube

As avid Airflow users might have noticed is that the scheduler occasionally has funky behaviour. Meaning that it stops scheduling tasks. A respected – though hacky – solution is to restart the scheduler every now and then. The way to solve this in kubernetes is by simply destroying the scheduler pod. Kubernetes will then automatically boot up a new scheduler pod.

Enter chaoskube. This amazing little tool – which also runs on your cluster – can be configured to kill pods within your cluster. It is highly configurable to target any pod to your liking.

Using the following command you can specify it to only target the airflow scheduler pod.

Concluding

Using a few highly available Azure services and a little effort you've now deployed a scalable Airflow solution on Kubernetes backed by a managed Postgres instance. Airflow also has a fully qualified domain name and is reachable over HTTPS. The kubernetes executor makes Airflow infinitely scalable without having to worry about workers.

Check out our Apache Airflow course, that teaches you the internals, terminology, and best practices of working with Airflow, with
hands-on experience in writing an maintaining data pipelines.

Airflow Install

Airflow Daemons

Web Server

Scheduler

Executors/Workers

Install Airflow Windows

How do the Daemons work together?

Single Node Airflow Setup

Multi-Node (Cluster) Airflow Setup

Benefits

Higher Availability

Distributed Processing

Scaling Workers

Horizontally

Vertically

Scaling Master Nodes

Apache Airflow Cluster Setup Steps

Pre-Requisites

Steps

Additional Documentation

Getting started

Airflow Install Windows

Kubernetes Executor on Azure Kubernetes Service (AKS)

Deploying with Helm

Azure Postgres

Airflow Install Packages

Airflow Installation In Ec2

Persistent logs and dags with Azure Fileshare

Learn Spark or Python in just one day

FQDN with Ingress controller

Deploying an ingress controller

Configuring a DNS name

Make it HTTPS

Chaoskube

Concluding

Author

Archives

Categories