MLlib - ML Library for Spark

  • MLlib is a ML library that works really well with Spark and especially with distributed training (note: not all models can do distributed training).

  • It has algorithms for classification, regression, clustering and collaborative filtering.

  • Along with the scalable manipulation of data using spark dataframes, it can be a great entry point into scalable machine learning without a dedicated team of ML engineers and data scientists.

  • Have a look at the documentation for learning some of these capabilities.

Our Goals

  • We will first see how to use surpriselib’s SVD model to train a collaborative filtering based model.

  • After that, we will follow the example from https://spark.apache.org/docs/latest/ml-collaborative-filtering.html that essentially builds the same model, but using MLlib instead.

  • While we will not be able to see the full power of MLlib algorithms on a single driver zero worker cluster with small data (movielens-100k is small!), it should expose you to the MLlib library.

Surpriselib Matrix Factorization Model

We can walk through the notebook using any of the formats below:

MLlib matrix Factorization Model

We can walk through the notebook using any of the formats below:

What Next?

  • We have barely scratched the surface of what spark, pyspark and MLlib can do in terms of scale. These tools along with IAAS solutions such as Google cloud storage, AWS S3, BigQuery etc form the key components in ML solutions in a variety of industries.

  • There are always alternative ways to do the same: e.g., distributed processing using tools such as Dataflow/Apache Beam or custom software, using other vendors such as Digitalocean or Rackspace. But the general patterns and considerations remain similar.

  • The key component of a spark deployment, namely the underlying cluster, deserves a careful look, as overprovisioning/underprovisioning them can have a non-trivial impact on one’s work.

  • Eventually the pipelines above will be integrated with a workflow management tool such as Airflow (Databricks already has MLflow integration) for automation and reliability.