Basics

Python

  • We will be predominantly concerned with the Python ecosystem
  • A big advanage is that local system development can be easily moved to cloud and or a scalable on-prem solution.
  • Many companies use python to start data science projects in-house (via fresh recruits, interns etc)
  • Python has some relatively easy ways to access databases
  • Big data platforms such as Spark have great python bindings
    • E.g., Pandas dataframe and Spark dataframe
  • Latest models (deep learning, pre-trained) are built in the python ecosystem
  • Many many useful libraries: pandas, matplotlib, flask,…

Our Objective

  • Learn the patterns, not the specific tools

Deployment Targets

  • Local machines
  • On-prem or self-hosted machines (needs DevOps skills)
  • Managed cloud
    • Heroku (PAAS)
    • Azure
    • GCP
    • AWS (IAAS)
  • The decision to deply on one versus the other depends on
    • skills
    • business need
    • internal vs external
    • scale, reliability, security
    • costs
    • ease of deployment

Local Deployments are Hard

  • Need to learn linux security
  • Need to learn how to manage access
  • Need for learn backups
  • Need to learn hot switching / reliability

Cloud Deployments are not Easy

  • Also need to learn a complex ecosystem
  • Vendor lock-in (for successful businesses, this is not an issue)

Aside: Software Tools

Python development can happen:

  • In text editors (e.g., sublime-text)
  • In IDEs (e.g., Pycharm or VSCode)
  • In Jupyter notebooks and variants (Google Colab, Databricks notebooks)
    • vanilla notebook does not allow collaboration as such

Part 1: Setting up Jupyter access on a VPS

  • We will use Vultr, but all steps are vendor agnostic. Alternatives include: Digitalocean, AWS EC2, Google Cloud; using Google Colab and other vendors.
  • SSH passwordless access is set up.
  • Next, we set up a basic firewall for security.
  • This is followed by installing conda.
  • (Optional) To run the jupyter server uninterrupted, we will run it within a screen session.
  • We will access the server and notebooks on our local browser using SSH tunneling.

Part 2: Preparing an ML Model

  • We will show how data is accessed, and how the model is trained (this should be familiar to you).

    • In particular, we will look at the moive recommendation problem.
  • There are aspects of saving and loading models that become important in production. For instance, we would like the models to be able to live across dev/staging/prod environments. For this, we think of the notion of model persistence

    • Natively:

      • For example, pytorch has native save and load methods.
      • Same is the case for scikit-learn and a variety of other packages.
    • Using MLFlow:

      • MLFlow addresses the problem of moving models across different environments without issues of incompatibility (minor version numbers, OS etc) among other things.
      • See these links for more information: https://pypi.org/project/mlflow/ and mlflow.org