Introduction

  • Stream processing is a way of architecting a system of computers that exploits parallelism. This is achieved by reducing the amount of synchronization requirements between interacting components.
  • From our limited view point (i.e., from ML deployment perspective), in such systems, one should think of data not primarily residing on disks/storage anymore.
  • Instead, we could think of data as always moving across parts of the cloud/system. At certain junctions, it is getting transformed or even logged.
  • This way of architecting systems comes with some benefits and challenges
    • Benefits include reduced latency as well as higher interoperability
      • Critical for many digital and cloud native industries (e.g., Internet companies, finance)
      • Different services can use different programming environments/languages/runtimes and communicate in a standard way with each other.
    • Challenges include building into the software things like fault tolerance and the notions of buffering.

Example Streaming Platforms

  • We will use Apache Kafka as an example streaming platform (this is a hosted solution) to illustrate some of the key ideas here.
  • Other solutions include
    • PubSub by GCP and Kinesis by AWS (these are managed solutions)
    • Apache Flume, Apache Apex, Apache Storm (these are hosted solutions).

What is the goal of this paradigm?

  • Streaming solutions allow for real-time data science pipelines.
  • The goal is to work on data as it is generated by a upstream component.
  • For instance, if users are generating data, this can be consumed as soon as its generated without us waiting for it to be written to disk or a data lake (this can be done eventually).
  • Another example is of real time recommendations, where as the user is interacting with an app, requests are made to end-points which immediately trigger various services.
  • The speed a which different services/cloud components produce and consume data is a defining characteristic of a streaming workflow.

Real World Examples

  • NYT, Pinterest, Adidas, Airbnb, Coursera, Cisco, Etsy, Spotify, Twitter, … use Apache Kafka in various forms. See https://kafka.apache.org/powered-by.
  • Zillow uses an equivalent stream processing solution Kinesis (source):

Zillow uses Kinesis Data Streams to collect public record data and MLS listings, and then update home value estimates in near real-time so home buyers and sellers can get the most up to date home value estimates. Zillow also sends the same data to its Amazon S3 data lake using Kinesis Data Firehose, so that all the applications can work with the most recent information.

  • Netflix also uses an equivalent stream processing solution Kinesis (source):

Netflix uses Amazon Kinesis to monitor the communications between all of its applications so it can detect and fix issues quickly, ensuring high service uptime and availability to its customers.

Relationship to ML Deployment

  • Terminology:
    • Different services on a streaming environment pass messages (e.g., data) to each other.
    • Producers produce messages and consumers consume messages.
  • ML models can transform incoming data (they are essentially functions) and send the outputs to downstream consumers.
    • The downstream consumers could be using multiple such predictions (perhaps even across time) to make decisions.
    • For example, lets pick video conferencing (Webex, Zoom, Google Meet and numerous others). A ML service could predict the internet connection quality of your phone/laptop and a downstream service could switch the video stream quality based on this.
  • The challenge is that the ML service has to scale when the number of prediction requests are very high!
  • This scaling can only be achieved in a cluster like setup. A single server/container deployment would be inadequate unless load balancing and other techniques are used.

Differences with Serverless Solutions

  • While serverless solutions provide excellent scalability, they are essentially running on single VMs/containers.
  • We saw how spark can work with volume earlier in the batch setting. With spark streaming, we can address the scale issue using the same ideas that helped achieve scale via distributed processing for batch workflows.

Our Goals

  • Set up a single node Apache Kafka streaming platform.
  • Set up a couple of python processes that send and receive messages. We will rely on pyspark streaming and Databricks.