Data Science Workflows

Introduction

  • In data science work streams, batch pipelines involve touching varied data sources (databases, warehouses, data lakes), generating features, imputing, exploration and many other tasks all the way to generating trained model artifacts.

  • While doing so, we think of the process from the start to end as blocks that can be chained in a sequence (or more generally as a directed acyclic graph or DAG).

  • Some desirable properties we want from model pipelines are:

    • ability to manage multiple pipelines
    • ability to run blocks in a schedule (nightly, hourly)
    • ability to detect if a previous block has not finished its part and quickly fix it! (reliability is important)
  • Ultimately we would like to manage pipelines without much manual work.

  • Workflow tools address these gaps. What they enable the user to do is

    • allow easy specification of the DAG
    • ensure the dependencies for each block are met
    • schedule blocks to be run automatically
  • Further these tools also control resources (compute, storage etc) and perform monitoring to achieve these goals.

  • Example tools:

TLDR

  • These tools are essential for production machine learning lifecyle management with multiple team members.

Our Goals

  • Build a batch pipeline running in a container
  • Use cron and Kubernetes for scheduling
  • Explore Airflow