Serving ML Models Using Web Servers

Model Serving

  • Sharing results with others (humans, web services, applications)
  • Batch approach: dump predictions to a database (quite popular)
  • Real-time approach: send a test feature vector, get back the prediction instantly and the computation happens now

How to consume from prediction services?

  • Using web requests (e.g., using a JSON payload)

How to output predictions?

  • We will plan to set up a server to serve predictions
    • It will respond to web requests (GET, POST)
    • We pass some inputs (image, text, vector of numbers), and get some outputs (just like a function)
    • The environment from which we pass inputs may be very different from the environment where the prediction happens (e.g., different hardware)

Our Objective

  • Use sklearn/keras with flask, gunicorn and heroku to set up a prediction server

Part 1: Making API Calls

  • Using the requests module from a jupyter notebook (this is an example of a programmatic way)
  • Alternatively, using curl or postman (these are more versatile)

Part 2: Simple Flask App

  • Function decorators are used in Flask to achive routes to functions mapping.
  • Integrating the model with the app is relatively easy if the model can be read from disk.