Apache Airflow

  • While cron and cron based scheduling is great, it becomes harder to manage if certain jobs fail and other scheduled jobs depend on their outputs.

  • Workflow tools help with resolving these types of dependencies.

  • They also allow for version control of objects beyond code.

  • These tools have additional capabilities such as alerting team members if a block/task/job failed so that someone can fix and even manually run it.

  • It is beneficial for the whole organization if they can use a similar tool:

    • data engineers doing ETL jobs,
    • data scientists doing model trainng jobs,
    • analysts doing reporting jobs etc.
  • We will go through Apache Airflow as an example workflow tool. There are many others, as we mentioned before.

airflow1

Source: https://airflow.apache.org/

  • As listed above, a key benefit with airflow is that it allows us to describe a ML pipeline in code (and in python!).

Airflow Basics

  • Airflow works with graphs (spcifically, directed acyclic graphs or DAGs) that relate tasks to each other and describe their ordering.

  • Each node in the DAG is a task, with incoming arrows from other tasks implying that they are upstream dependencies.

  • Lets install the airflow package and get a server running. From the quickstart page

    # airflow needs a home, ~/airflow is the default,
    # but you can lay foundation somewhere else if you prefer
    # (optional)
    export AIRFLOW_HOME=~/airflow
    
    # install from pypi using pip
    pip install apache-airflow
    
    # initialize the database
    airflow initdb
    
    # start the web server, default port is 8080
    airflow webserver -p 8080
    
    # start the scheduler
    airflow scheduler
    
    # visit localhost:8080 in the browser and enable the example dag in the home page
    
  • For instance, when you start the webserver, you should seen an output similar to below:

    (datasci-dev) ttmac:lec05 theja$ airflow webserver -p 8080
    ____________       _____________
    ____    |__( )_________  __/__  /________      __
    ____  /| |_  /__  ___/_  /_ __  /_  __ \_ | /| / /
    ___  ___ |  / _  /   _  __/ _  / / /_/ /_ |/ |/ /
    _/_/  |_/_/  /_/    /_/    /_/  \____/____/|__/
    [2020-09-24 12:55:50,012] {__init__.py:50} INFO - Using executor SequentialExecutor
    [2020-09-24 12:55:50,012] {dagbag.py:417} INFO - Filling up the DagBag from /Users/theja/airflow/dags
    /Users/theja/miniconda3/envs/datasci-dev/lib/python3.7/site-packages/airflow/models/dag.py:1342: PendingDeprecationWarning: The requested task could not be added to the DAG because a task with task_id create_tag_template_field_result is already in the DAG. Starting in Airflow 2.0, trying to overwrite a task will raise an exception.
    category=PendingDeprecationWarning)
    Running the Gunicorn Server with:
    Workers: 4 sync
    Host: 0.0.0.0:8080
    Timeout: 120
    Logfiles: - -
    .
    .
    (truncated)
    .
    .
    
  • Similarly when the scheduler is started, you should see:

    (datasci-dev) ttmac:lec05 theja$ airflow scheduler
    ____________       _____________
    ____    |__( )_________  __/__  /________      __
    ____  /| |_  /__  ___/_  /_ __  /_  __ \_ | /| / /
    ___  ___ |  / _  /   _  __/ _  / / /_/ /_ |/ |/ /
    _/_/  |_/_/  /_/    /_/    /_/  \____/____/|__/
    [2020-09-24 12:57:27,736] {__init__.py:50} INFO - Using executor SequentialExecutor
    [2020-09-24 12:57:27,774] {scheduler_job.py:1367} INFO - Starting the scheduler
    [2020-09-24 12:57:27,775] {scheduler_job.py:1375} INFO - Running execute loop for -1 seconds
    [2020-09-24 12:57:27,775] {scheduler_job.py:1376} INFO - Processing each file at most -1 times
    [2020-09-24 12:57:27,775] {scheduler_job.py:1379} INFO - Searching for files in /Users/theja/airflow/dags
    [2020-09-24 12:57:27,785] {scheduler_job.py:1381} INFO - There are 25 files in /Users/theja/airflow/dags
    [2020-09-24 12:57:27,785] {scheduler_job.py:1438} INFO - Resetting orphaned tasks for active dag runs
    [2020-09-24 12:57:27,802] {dag_processing.py:562} INFO - Launched DagFileProcessorManager with pid: 5109
    [2020-09-24 12:57:27,812] {settings.py:55} INFO - Configured default timezone <Timezone [UTC]>
    [2020-09-24 12:57:27,829] {dag_processing.py:776} WARNING - Because we cannot use more than 1 thread (max_threads = 2) when using sqlite. So we set parallelism to 1.
    
  • Following this, we can go to localhost:8080 to see the follwoing:

airflowweb1

  • When the above sequence of commands was ran, airflow created a config file in ~/airflow folder. This config file has about 1000 lines.

    (datasci-dev) ttmac:~ theja$ cd airflow/
    (datasci-dev) ttmac:airflow theja$ less airflow.cfg
    [core]
    # The folder where your airflow pipelines live, most likely a
    # subfolder in a code repository. This path must be absolute.
    dags_folder = /Users/theja/airflow/dags
    
    # The folder where airflow should store its log files
    # This path must be absolute
    base_log_folder = /Users/theja/airflow/logs
    .
    .
    (truncated)
    .
    .
    (datasci-dev) ttmac:airflow theja$ wc -l airflow.cfg
    1073 airflow.cfg
    
  • Airflow manages information about pipelines through a database. By default is it sqlite (we could change this to something else if needed). This is initialized via the initdb argument.

  • The scheduler executes out tasks on workers (machines).

  • The webserver allows us to interact with the task scheduler and the database.

Anatomy of a Workflow Specification

  • The key idea is that We need to create a python file to define the workflow DAG.

  • A key module that we will import is called the BashOperator, which allows us to run arbitrary commands (e.g., docker run image) as long as the dependencies are there (e.g., the docker daemon, the local image registry, and command line utility).

  • There are a set of parameters that one should set for any workflow. For instance, who is the owner of this workflow, and if they need to be alerted by email.

  • We next create an instance of the DAG class.

  • We can define our command line task using the BashOperator. There are various kinds of operators available. We will also make it a node in our DAG.

  • If there are additional tasks, we related them to each other.

A very quick start using the tutorial workflow

  • Lets start by going through the tutorial in their documentation. After that we will run our transient pipeline as a workflow through airflow. Below is a gist/anatomy of a workflow specification.

  • We can execute the tutorial workflow by using the following command:

    (datasci-dev) ttmac:dags theja$ airflow backfill tutorial -s 2020-09-20 -e 2020-09-22
    
  • We can watch the progress in the browser by going to Browse -> Task Instances. You can see the progress snapshots below.

tut1

tut2

  • Here it shows the successful completion of all tasks.

tut3

Workflow Specification for Transient Training Pipeline

  • We can specify our workflow in the ~/airflow/dags folder as recommend.py, which will get picked up automatically by the scheduler.

  • If it is not automatically added, try running the following command:

    (datasci-dev) ttmac:dags theja$ airflow list_dags
    [2020-09-24 15:02:44,915] {__init__.py:50} INFO - Using executor SequentialExecutor
    [2020-09-24 15:02:44,915] {dagbag.py:417} INFO - Filling up the DagBag from /Users/theja/airflow/dags
    /Users/theja/miniconda3/envs/datasci-dev/lib/python3.7/site-packages/airflow/models/dag.py:1342: PendingDeprecationWarning: The requested task could not be added to the DAG because a task with task_id create_tag_template_field_result is already in the DAG. Starting in Airflow 2.0, trying to overwrite a task will raise an exception.
    category=PendingDeprecationWarning)
    
    
    -------------------------------------------------------------------
    DAGS
    -------------------------------------------------------------------
    example_bash_operator
    example_branch_dop_operator_v3
    example_branch_operator
    example_complex
    example_external_task_marker_child
    example_external_task_marker_parent
    example_http_operator
    example_kubernetes_executor_config
    example_nested_branch_dag
    example_passing_params_via_test_command
    example_pig_operator
    example_python_operator
    example_short_circuit_operator
    example_skip_dag
    example_subdag_operator
    example_subdag_operator.section-1
    example_subdag_operator.section-2
    example_trigger_controller_dag
    example_trigger_target_dag
    example_xcom
    latest_only
    latest_only_with_trigger
    recommend-pipeline
    test_utils
    tutorial
    
  • Our python script’s contents are reproduced below (to check for syntax issues just run the py file on the commandline):

    # [START import_module]
    from datetime import timedelta
    
    # The DAG object; we'll need this to instantiate a DAG
    from airflow import DAG
    # Operators; we need this to operate!
    from airflow.operators.bash_operator import BashOperator
    from airflow.utils.dates import days_ago
    
    # [END import_module]
    
    # [START default_args]
    # These args will get passed on to each operator
    # You can override them on a per-task basis during operator initialization
    default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'start_date': days_ago(2),
    'email': ['myself@theja.org'],
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 1,
    'retry_delay': timedelta(minutes=5),
    # 'queue': 'bash_queue',
    # 'pool': 'backfill',
    # 'priority_weight': 10,
    # 'end_date': datetime(2016, 1, 1),
    # 'wait_for_downstream': False,
    # 'dag': dag,
    # 'sla': timedelta(hours=2),
    # 'execution_timeout': timedelta(seconds=300),
    # 'on_failure_callback': some_function,
    # 'on_success_callback': some_other_function,
    # 'on_retry_callback': another_function,
    # 'sla_miss_callback': yet_another_function,
    # 'trigger_rule': 'all_success'
    }
    # [END default_args]
    
    # [START instantiate_dag]
    dag = DAG(
    'recommend-pipeline',
    default_args=default_args,
    description='Run the transient training pipeline',
    schedule_interval=timedelta(days=1),
    )
    # [END instantiate_dag]
    
    t1 = BashOperator(
    task_id='docker-pipeline-run',
    bash_command='docker run recommend_pipeline',
    dag=dag,
    )
    
    
    # [START documentation]
    dag.doc_md = __doc__
    
    t1.doc_md = """\
    #### Transient Pipeline
    Downloads movielens-100k, trains a recommendation model and saves top 10 recommendations to Google BigQuery.
    """
    # [END documentation]
    
    
    t1
    # [END tutorial]
    
  • The task can be seen from the browser UI as well:

rec1

  • We can run this workflow by triggering it through the UI or by using the backfill argument.

    (datasci-dev) ttmac:dags theja$ airflow backfill recommend-pipeline -s 2020-09-01 -e 2020-09-01
    
  • We can verify that the task ran successfully in the browser.

rec2

  • We can also check that the timestamp of update in BigQuery reflects the successful completion of the transient training pipeline.

rec2

Remarks

  • If there were other tasks, they can be specified similarly and can be related to each other in the script using .set_upstream() function (there are other ways, we already saw one in the tutorial).

  • Instead of the BashOperator, we can also use DockerOperator (we haven’t done this here).

  • Next, we will see how to use a managed solution (K8s) to run airflow and our pipelines.