Basics
Python
- We will be predominantly concerned with the Python ecosystem
- A big advanage is that local system development can be easily moved to cloud and or a scalable on-prem solution.
- Many companies use python to start data science projects in-house (via fresh recruits, interns etc)
- Python has some relatively easy ways to access databases
- Big data platforms such as Spark have great python bindings
- E.g., Pandas dataframe and Spark dataframe
- Latest models (deep learning, pre-trained) are built in the python ecosystem
- Many many useful libraries: pandas, matplotlib, flask,…
Our Objective
- Learn the patterns, not the specific tools
Deployment Targets
- Local machines
- On-prem or self-hosted machines (needs DevOps skills)
- Managed cloud
- Heroku (PAAS)
- Azure
- GCP
- AWS (IAAS)
- The decision to deply on one versus the other depends on
- skills
- business need
- internal vs external
- scale, reliability, security
- costs
- ease of deployment
Local Deployments are Hard
- Need to learn linux security
- Need to learn how to manage access
- Need for learn backups
- Need to learn hot switching / reliability
Cloud Deployments are not Easy
- Also need to learn a complex ecosystem
- Vendor lock-in (for successful businesses, this is not an issue)
Python development can happen:
- In text editors (e.g., sublime-text)
- In IDEs (e.g., Pycharm or VSCode)
- In Jupyter notebooks and variants (Google Colab, Databricks notebooks)
- vanilla notebook does not allow collaboration as such
Part 1: Setting up Jupyter access on a VPS
- We will use Vultr, but all steps are vendor agnostic. Alternatives include: Digitalocean, AWS EC2, Google Cloud; using Google Colab and other vendors.
- SSH passwordless access is set up.
- Next, we set up a basic firewall for security.
- This is followed by installing
conda
.
- (Optional) To run the jupyter server uninterrupted, we will run it within a screen session.
- We will access the server and notebooks on our local browser using SSH tunneling.
Part 2: Preparing an ML Model
We will show how data is accessed, and how the model is trained (this should be familiar to you).
- In particular, we will look at the moive recommendation problem.
There are aspects of saving and loading models that become important in production. For instance, we would like the models to be able to live across dev/staging/prod environments. For this, we think of the notion of model persistence
Natively:
- For example, pytorch has native save and load methods.
- Same is the case for scikit-learn and a variety of other packages.
Using MLFlow:
- MLFlow addresses the problem of moving models across different environments without issues of incompatibility (minor version numbers, OS etc) among other things.
- See these links for more information: https://pypi.org/project/mlflow/ and mlflow.org