Scheduling
Airflow is the focus of this page
We always look for an efficient system to handle repetitive tasks with the least human intervention. If there is any need to perform a set of tasks after every regular interval or at some trigger event; we look for an efficient system to do it.
So far, the cron job was in use for Linux-based systems. Airflow is one of the most efficient and powerful tools for scheduling purposes.
Airflow not only performs all the tasks of the cron job but in addition, also comes with a handy user interface for monitoring. Additionally, Airflow can apply conditions as well, if we want to line up more than one task, and tasks are not mutually independent but depend on the completion of other tasks
Cron Job
is a time-based job scheduler in Unix-like computer operating systems
used to schedule jobs (commands or shell scripts) to run periodically at fixed time, dates, or intervals
automates system maintenance or administration—though its general-purpose nature makes it useful for things like downloading files from the Internet and downloading email at regular intervals
Airflow
Reference youtube video link for complete airflow tutorial: here
Case Study: here
Features: Fault Tolerant (in-build distributed computing possible), Monitoring and Alerting
It can perform all the tasks of cron job on any machine
It allows to programmatically author and schedule complex workflows, along with a beautiful UI for monitoring. It also sends an email of the task execution and provides many other features
It makes the data-pipeline creation, maintenance, and monitoring super easy
With the emergence of data science, the use cases of Airflow have increased even further for scheduling the model training; cleaning, and updating the data
Airflow implementation is super easy because it can be implemented using python language
DAGs (Directed Acyclic Graphs):
A DAG is a series of tasks that you want to run as part of your workflow. The task can be thought of as an operation.
Airflow enables us to also specify the relationship between the tasks, any dependencies, and the order in which the tasks should be run
A DAG is written in Python and saved as a .py file. The DAG_ID is used extensively by the tool to orchestrate the running of the DAGs.
An operator encapsulates the operation to be performed in each task in a DAG. Different types of operators are available in Airflow
BashOperator - executes a bash command
PythonOperator - calls an arbitrary Python function
EmailOperator - sends an email
SimpleHttpOperator - sends an HTTP request
MySqlOperator, SqliteOperator, PostgresOperator, MsSqlOperator, OracleOperator, JdbcOperator, etc. - executes a SQL command
Sensor - waits for a certain time, file, database row, S3 key, etc…
You can also come up with a custom operator as per your need.
Installation and Running Airflow
Implementation
Airflow uses directed acyclic graphs (DAGs) to manage workflow orchestration (No need to get intimidated by DAGS, try implementation of one DAG and you will have the idea)
Tasks and dependencies are defined in Python
Airflow manages the scheduling and execution
DAGs can run either on a defined schedule (e.g. hourly) or based on some event triggers
If you want, it sends email or messages in case of any error in running a scheduled job
Message Sender Scheduler
Objective: Get prices of 5 specific shares (TCS, INFOSYS, Reliance Industries, HDFC Bank, and ICICI Bank) from the money control website every day at 11:00 am and 5:00 pm. Scrap the data of 5 shares and send the message and email on your personal mobile number and email id (for scheduling, use airflow)
Have a look at the following code for other elements of the code and implementation: code
Github link for this application will be updated shortly
Data pipeline using Airflow
Objective: Create 3 tasks and manage the order of three tasks
Task 1: Read from one Elasticsearch (ES) index --> Process the Data --> Write to Another ES Index
Task 2: Clean the data of processed ES index
Task 3: Read from processed ES index --> Write to Redis Database
Github link for this application to be updated shortly
Few things to keep in mind
Clustering of airflow dags, basis
Criticality of the jobs, resource requirements of the job, data source, processing type, region of job, etc.
Jenkins
Since we are talking about automation here, it is worth mentioning software for automating the deployment of microservices and other applications. I will not be implementing it or talking about it in detail as of now.
Jenkins is automation server which helps in automating the parts of software development related to building, testing, and deploying, facilitating continuous integration and continuous delivery
References for Further Reading
Once you implement a sample code for any of the DAGs in airflow, you should have a pretty good idea about its use cases and how to do it, and thereafter, the theoretical background can be strengthened through the below two links
Last updated
Was this helpful?