Spark based Pipelines

Introduction

Spark

  • Spark lets you run data tasks (preprocessing, feature engineering, training) on multiple machines.
  • A core idea behind spark is the notion of resilient distributed datasets (RDDs).
  • Using this core idea, spark is able to manage fault tolerance and scale.
  • Spark also has a abstract data type called dataframe, similar to pandas and R.
  • This dataframe interface sits on top of RDDs and allows for more approachable specification of our tasks.
  • The the name of RDD implies, we are primarily looking at data volumes much bigger than what a typical single machine can handle.
    • Single machines already have ~100 gigs of RAM and 12-32 cores for compute.
  • Very popular in multiple industries and thousands of companies.
    • Can be seen as a successor to Hadoop and MapReduce systems (technically it is a part of Hadoop)

PySpark

  • PySpark is a way to access spark using Python.
  • As much of the work in ML is being done in Python, this is a very useful tool to scale to a large data volume.
  • And that scaling can be achieved without a lot of engineering effort.

Spark vs Container Deployments

  • We have looked at both hosted and managed deployments of model training and serving via containerization/docker, serverless techniques, kubernetes, airflow etc.
  • The container that we built have to run on underlying nodes in isolation. That is, the training code cannot spread across machines.
    • And we need this if the data to train on is very large.
  • Spark lets us span our tasks across nodes for both training and serving, which is a distinct advantage over containerized training pipelines in certain business scenarios.

Our Goals

  • We will go through the basics of PySpark dataframes.
    • We will also learn Pandas user defined functions (UDFs) and how they help work with PySpark dataframes and large data with ease.
  • Use it to design tasks and pipelines, and also take deployment into consideration.
    • For instance, we will look at how to get data from AWS S3 read by the spark cluster, as well as how to write data from spark to S3.
  • See how to use PySpark on the cloud (AWS/GCP).