Data Streaming

Data streaming is defined as data that is being generated continuously by different data sources that update with a high frequency, in an almost real-time scenario. It is usually for a data type that has no end or beginning, but is continuous in nature.

What is Data Streaming?

Data Streaming refers to the instrumentation and processing of continuous data streams. It is often compared to Batching, the practice of moving data sets from storage to storage at triggered intervals. In this paradigm, data is considered “in motion” meaning when a data point is generated in a data source, it is immediately processed and passed on to consumer systems - instead of being collected in a storage service for future processing. Most of today’s source data is generated in a streaming fashion: transactions, logs, sensor data, social media feeds, clickstreams… By processing these streams in a Data Streaming paradigm, organizations can gain insights, detect anomalies and trends, and take action on the data while it’s being generated.

Standard Batch processing architecture

Hybrid processing Architecture

End to end live processing

Why use a Data Streaming solution?

The core benefits from Data Streaming are:

Live Processing: Stream processing typically has a very low latency. Since data is processed continuously as soon as it arrives, there is minimal delay in obtaining the processed results. It becomes possible to analyze and act upon data while it is being generated instead of after-the-facts.
Business Value: a consequence of timeliness, Data Streaming can target use-case not accessible to other systems. For instance replicating data sets in real-time, customizing users sessions while they are live, preventing defects before they become costly, or establishing the risk score of a transaction on the fly.
Resource Efficiency: with stream processing, only the differences between the current and new data is computed rather than the entire dataset being reprocessed. Data points can easily be added, modified, or deleted and the results are updated immediately.
Scalability: Stream processing systems are highly scalable thanks to their incremental nature. They are designed to handle large and rapidly changing volumes of data, allowing organizations to scale gracefully as data needs evolve.
Fault tolerance: Stream processing systems are built with fault tolerance in mind and have the ability to recover from failure at specific points in time making them very efficient at supporting both analytical and operational workloads.
Simplified data operations: Since stream are processed continuously, Data Streaming eliminates the need for a data orchestration service, data operations are less complex and more agile.

Glossary

In the world of continuous data there can be some specific vocabulary, here are some exemples:

An Event: A discrete data point. An event can be anything from a user action to a sensor reading.
Event-Driven Architecture: An architectural paradigm that makes use of events for communication between components. Event-Driven Architectures focus on agile and scalable systems by decoupling applications into smaller, more modular components and treating data like a service that can interface with other applications.
A Stream: A structured and named sequence of events. These Streams are continuous in nature even though the events within them may not be.
To Publish: the act of sending data into a stream.
To Consume: the act of receiving data from a stream.
Producer: the name given to an entity or component that publishes data into a stream.
Consumer: the name given to an entity or component that consumes data from a stream.
A Subscription: a programmatic agreement between a producer and a consumer that allows the consumer to consume the content of a stream. When a consumer subscribes to a data stream, they are essentially requesting that the provider send them data as it becomes available, rather than the consumer needing to periodically poll the source for updates. This allows the consumer to receive near-real-time updates without having to constantly check for new data.
Cursor / Offset: cursors mark the last event consumed by a subscription. Cursors are very useful for data operations like replaying data, updating a data service on the fly, or to reliably recover from failures.

Some of the Current Solutions

Confluent: Confluent is a technology company that was founded in 2014 by the original creators of Apache Kafka, a popular open-source distributed streaming platform. Confluent provides a comprehensive platform built around Kafka, called Confluent Platform. The Confluent Platform offers various tools, services, and enhancements to Apache Kafka, making it easier for developers and organizations to work with data streams.

Popsink: is a managed stream processing service. It aims at seamlessly integrating with existing Modern Data Stack solutions. Popsink’s focus is on abstracting away the operations on data streams to help users leverage continuous data from existing tools, without going through any migration or retraining.

Materialize: Materialize is an engine that enables the materialization of views on top of streaming data in SQL. The company builds both an Open Source and a Cloud Native offering of their technology. One of the core features of Materialize is its PostresSQL compatibility making it very easy to use exciting PostresSQL-compatible tools and services.

RisingWave Labs: RisingWave Labs is the company behind RisingWave. Similar to Materialize, RisingWave is an Open Source distributed SQL streaming engine. Written in Rust and compatible with PostresSQL, RisingWave is also available as a managed cloud offering.