Scheduling

Airflow is the focus of this page

We always look for an efficient system to handle repetitive tasks with the least human intervention. If there is any need to perform a set of tasks after every regular interval or at some trigger event; we look for an efficient system to do it.

So far, the cron job was in use for Linux-based systems. Airflow is one of the most efficient and powerful tools for scheduling purposes.

Airflow not only performs all the tasks of the cron job but in addition, also comes with a handy user interface for monitoring. Additionally, Airflow can apply conditions as well, if we want to line up more than one task, and tasks are not mutually independent but depend on the completion of other tasks

Cron Job

  • is a time-based job scheduler in Unix-like computer operating systems

  • used to schedule jobs (commands or shell scripts) to run periodically at fixed time, dates, or intervals

  • automates system maintenance or administration—though its general-purpose nature makes it useful for things like downloading files from the Internet and downloading email at regular intervals

Airflow

  • Reference youtube video link for complete airflow tutorial: here

  • Case Study: here

  • Features: Fault Tolerant (in-build distributed computing possible), Monitoring and Alerting

  • It can perform all the tasks of cron job on any machine

  • It allows to programmatically author and schedule complex workflows, along with a beautiful UI for monitoring. It also sends an email of the task execution and provides many other features

  • It makes the data-pipeline creation, maintenance, and monitoring super easy

  • With the emergence of data science, the use cases of Airflow have increased even further for scheduling the model training; cleaning, and updating the data

  • Airflow implementation is super easy because it can be implemented using python language

Installation and Running Airflow

Implementation

Message Sender Scheduler

Objective: Get prices of 5 specific shares (TCS, INFOSYS, Reliance Industries, HDFC Bank, and ICICI Bank) from the money control website every day at 11:00 am and 5:00 pm. Scrap the data of 5 shares and send the message and email on your personal mobile number and email id (for scheduling, use airflow)

Have a look at the following code for other elements of the code and implementation: code

System/server needs to be up at specific times to perform the task, otherwise, the task will be performed whenever the system (and application) is up after the scheduled time

Github link for this application will be updated shortly

Data pipeline using Airflow

Objective: Create 3 tasks and manage the order of three tasks

  • Task 1: Read from one Elasticsearch (ES) index --> Process the Data --> Write to Another ES Index

  • Task 2: Clean the data of processed ES index

  • Task 3: Read from processed ES index --> Write to Redis Database

Hint: If you have not read and implemented anything related to Elasticsearch and Redis database, you may choose to read the two sections with the Database tab first

Github link for this application to be updated shortly

Few things to keep in mind

  • Clustering of airflow dags, basis

    • Criticality of the jobs, resource requirements of the job, data source, processing type, region of job, etc.

Jenkins

Since we are talking about automation here, it is worth mentioning software for automating the deployment of microservices and other applications. I will not be implementing it or talking about it in detail as of now.

Jenkins is automation server which helps in automating the parts of software development related to building, testing, and deploying, facilitating continuous integration and continuous delivery

References for Further Reading

Once you implement a sample code for any of the DAGs in airflow, you should have a pretty good idea about its use cases and how to do it, and thereafter, the theoretical background can be strengthened through the below two links

Last updated

Was this helpful?