Chapter 1. Streaming 101

The reasons we need a streaming data processing:

  1. Businesses crave timely insights into their data
  2. The massive, unbounded datasets that are increasingly common
  3. Processing data as they arrive spreads workloads out more evenly over time

Terminology: What Is Streaming?

Streaming system: A type of data processing engine that is designed with infinite dataset in mind
Bounded data: A type of dataset that is finite in size
Unbounded data: A type of dataset that is infinite in size (at least theoretically)
Table: A holistic view of dataset at a specific point in time. SQL systems have traditionally dealt in tables
Stream: An element-by-element view of the evolution of a dataset over time. The MapReduce lineage of data processing systems have traditionally dealt in streams.

Mentions:

  1. Streaming doesn’t only mean low latency, inaccurate, or speculative results
  2. Batch system doesn’t always meaning eventually correct results too.

Both statements may be from traditional Lambda Architecture. Using a replayable system like Kafka (Kappa architecture) could help. In Google, Cloud Dataflow provides both batch and streaming runners under the same unified model. Steaming systems combined with robust frameworks for unbounded data processing will in time allow for the relegation of the Lambda Architecture.

Correctness is important. Streaming systems need a method for checkpointing persistent state over time.

Good tools for reasoning about time are essential for dealing with unbounded, unordered data of varying event-time skew.

  • Event time: This is the time at which events actually occurred
  • Processing time: This is the time at which events are observed in the system.

In real world, things that can affect the level of skew between event time and processing include the following:

  1. Shared resource limitation, like network congestion, network partitions, or shared CPU in a nondedicated environment
  2. Software causes such as distributed system logic, contention, and so on
  3. Features of the data themselves, like key distribution, variance in throughput or variance in disorder.

The overall mapping(lag/skew) between event time and process time is not static. The way many systems designed for unbounded data have historically operated. These systems typically provide some notion of windowing the incoming data. However, disorder and variable skew induce a completeness problem for event-time windows. We should design tools that allow us to live in the world of uncertainty imposed by these complex datasets.

Unbounded Data: Batch

Such approaches revolve around slicing up the unbounded data into a collection of bounded datasets appropriate for batch processing.

  1. Fix windows: window the input data into fixed-size windows (completeness issue)
  2. Sessions: defined as periods of activity (e.g., for a specific user) terminated by a gap of inactivity.

Unbounded Data: Streaming

For many real-world, distributed input sources, you will find:

  1. Highly unordered with respect to event times – need to sort by time
  2. Of varying event-time skew

Generally categorize these approaches into four groups: time-agnostic, approximation, windowing by processing time, and windowing by event time.

  1. Time-agnostic: Time-agnostic processing is used for cases in which time is irrelevant. More data is only important thing. Examples:
    • Filter
    • Inner joins: if you care only about the results of a join when an element from both sources arrive, there’s no temporal element to the logic.
    • Approximation algorithms: these algorithms usually process elements as they arrive like approximate Top-N
  2. Windowing: similar fix/sliding/session-based windowing.
    • Windowing by process time – Simple, straightforward but missing event-time order
    • Windowing by event time – Powerful not for free
      • Buffering: extended window lifetime, more buffering of data is required
      • Completeness: no good way of knowing when we’ve seen all of the data for a given windows – need heuristic estimate of window completion via something like the watermarks.

Leave a Reply

Your email address will not be published. Required fields are marked *