Basics :: MLOps: Operationalizing Machine Learning

Basics

We will be predominantly concerned with the Python ecosystem
A big advanage is that local system development can be easily moved to cloud and or a scalable on-prem solution.
Many companies use python to start data science projects in-house (via fresh recruits, interns etc)
Python has some relatively easy ways to access databases
Big data platforms such as Spark have great python bindings
- E.g., Pandas dataframe and Spark dataframe
Latest models (deep learning, pre-trained) are built in the python ecosystem
Many many useful libraries: pandas, matplotlib, flask,…

Python development can happen:

In text editors (e.g., sublime-text)
In IDEs (e.g., Pycharm or VSCode)
In Jupyter notebooks and variants (Google Colab, Databricks notebooks)
- vanilla notebook does not allow collaboration as such

We will use Vultr, but all steps are vendor agnostic. Alternatives include: Digitalocean, AWS EC2, Google Cloud; using Google Colab and other vendors.
SSH passwordless access is set up.
Next, we set up a basic firewall for security.
This is followed by installing conda.
(Optional) To run the jupyter server uninterrupted, we will run it within a screen session.
We will access the server and notebooks on our local browser using SSH tunneling.

We will show how data is accessed, and how the model is trained (this should be familiar to you).
- In particular, we will look at the moive recommendation problem.
There are aspects of saving and loading models that become important in production. For instance, we would like the models to be able to live across dev/staging/prod environments. For this, we think of the notion of model persistence
- Natively:
  - For example, pytorch has native save and load methods.
  - Same is the case for scikit-learn and a variety of other packages.
- Using MLFlow:
  - MLFlow addresses the problem of moving models across different environments without issues of incompatibility (minor version numbers, OS etc) among other things.
  - See these links for more information: https://pypi.org/project/mlflow/ and mlflow.org