Spark lets you run data tasks (preprocessing, feature engineering, training) on multiple machines.
A core idea behind spark is the notion of resilient distributed datasets (RDDs).
Using this core idea, spark is able to manage fault tolerance and scale.
Spark also has a abstract data type called dataframe, similar to pandas and R.
This dataframe interface sits on top of RDDs and allows for more approachable specification of our tasks.
The the name of RDD implies, we are primarily looking at data volumes much bigger than what a typical single machine can handle.
Single machines already have ~100 gigs of RAM and 12-32 cores for compute.
Very popular in multiple industries and thousands of companies.
Can be seen as a successor to Hadoop and MapReduce systems (technically it is a part of Hadoop)
PySpark
PySpark is a way to access spark using Python.
As much of the work in ML is being done in Python, this is a very useful tool to scale to a large data volume.
And that scaling can be achieved without a lot of engineering effort.
Spark vs Container Deployments
We have looked at both hosted and managed deployments of model training and serving via containerization/docker, serverless techniques, kubernetes, airflow etc.
The container that we built have to run on underlying nodes in isolation. That is, the training code cannot spread across machines.
And we need this if the data to train on is very large.
Spark lets us span our tasks across nodes for both training and serving, which is a distinct advantage over containerized training pipelines in certain business scenarios.
Our Goals
We will go through the basics of PySpark dataframes.
We will also learn Pandas user defined functions (UDFs) and how they help work with PySpark dataframes and large data with ease.
Use it to design tasks and pipelines, and also take deployment into consideration.
For instance, we will look at how to get data from AWS S3 read by the spark cluster, as well as how to write data from spark to S3.