Serving ML Models Using Web Servers

Serving ML Models Using Web Servers

Model Serving

Sharing results with others (humans, web services, applications)
Batch approach: dump predictions to a database (quite popular)
Real-time approach: send a test feature vector, get back the prediction instantly and the computation happens now

How to consume from prediction services?

Using web requests (e.g., using a JSON payload)

How to output predictions?

We will plan to set up a server to serve predictions
- It will respond to web requests (GET, POST)
- We pass some inputs (image, text, vector of numbers), and get some outputs (just like a function)
- The environment from which we pass inputs may be very different from the environment where the prediction happens (e.g., different hardware)

Our Objective

Use sklearn/keras with flask, gunicorn and heroku to set up a prediction server

Part 1: Making API Calls

Using the requests module from a jupyter notebook (this is an example of a programmatic way)
Alternatively, using curl or postman (these are more versatile)

Part 2: Simple Flask App

Function decorators are used in Flask to achive routes to functions mapping.
Integrating the model with the app is relatively easy if the model can be read from disk.