Basics
            
          
        
Python
- We will be predominantly concerned with the Python ecosystem
- A big advanage is that local system development can be easily moved to cloud and or a scalable on-prem solution.
- Many companies use python to start data science projects in-house (via fresh recruits, interns etc)
- Python has some relatively easy ways to access databases
- Big data platforms such as Spark have great python bindings
- E.g., Pandas dataframe and Spark dataframe
 
- Latest models (deep learning, pre-trained) are built in the python ecosystem
- Many many useful libraries: pandas, matplotlib, flask,…
Our Objective
- Learn the patterns, not the specific tools
Deployment Targets
- Local machines
- On-prem or self-hosted machines (needs DevOps skills)
- Managed cloud
- Heroku (PAAS)
- Azure
- GCP
- AWS (IAAS)
 
- The decision to deply on one versus the other depends on
- skills
- business need
- internal vs external
- scale, reliability, security
- costs
- ease of deployment
 
Local Deployments are Hard
- Need to learn linux security
- Need to learn how to manage access
- Need for learn backups
- Need to learn hot switching / reliability
Cloud Deployments are not Easy
- Also need to learn a complex ecosystem
- Vendor lock-in (for successful businesses, this is not an issue)
Python development can happen:
- In text editors (e.g., sublime-text)
- In IDEs (e.g., Pycharm or VSCode)
- In Jupyter notebooks and variants (Google Colab, Databricks notebooks)
- vanilla notebook does not allow collaboration as such
 
Part 1: Setting up Jupyter access on a VPS
- We will use Vultr, but all steps are vendor agnostic. Alternatives include: Digitalocean, AWS EC2, Google Cloud; using Google Colab and other vendors.
- SSH passwordless access is set up.
- Next, we set up a basic firewall for security.
- This is followed by installing conda.
- (Optional) To run the jupyter server uninterrupted, we will run it within a screen session.
- We will access the server and notebooks on our local browser using SSH tunneling.
Part 2: Preparing an ML Model
- We will show how data is accessed, and how the model is trained (this should be familiar to you). - 
- In particular, we will look at the moive recommendation problem.
 
- There are aspects of saving and loading models that become important in production. For instance, we would like the models to be able to live across dev/staging/prod environments. For this, we think of the notion of model persistence - 
- Natively: - 
- For example, pytorch has native save and load methods.
- Same is the case for scikit-learn and a variety of other packages.
 
- Using MLFlow: - 
- MLFlow addresses the problem of moving models across different environments without issues of incompatibility (minor version numbers, OS etc) among other things.
- See these links for more information: https://pypi.org/project/mlflow/ and mlflow.org