How VoltSP differs from the traditional stream processing engines¶
VoltSP is a stream processing framework that takes a different approach from traditional streaming engines. Understanding this difference is key to getting the most out of VoltSP.
Characteristics of Traditional Streaming Engines¶
Traditional stream processing engines are typically designed around stateful workers. They store state on disk or in the memory of individual workers, and they often share state, exchange data, and shuffle records between workers. This architecture has several characteristics:
- State management: Workers checkpoint their state, manage recovery, and handle rebalancing when workers join or leave the cluster.
- Failure recovery: When a worker fails, the system restores state from checkpoints and reprocesses data to ensure consistency.
- Deployment considerations: Rolling out updates, scaling, or migrating requires orchestration to preserve state consistency.
The VoltSP Approach: Stateless Workers, No Coordination¶
VoltSP takes a different approach: workers are stateless and do not coordinate with each other. This design simplifies deployment, operation, and maintenance. There are no checkpoints to manage within workers, no state rebalancing during scaling, and no complex recovery procedures at the worker level.
State can be stored in VoltDB instead. VoltSP supports and extends the delivery guarantees provided by the source. If the source supports commits and at-least-once delivery semantics, VoltSP extends these guarantees to the entire pipeline — it only commits the source after data has successfully reached the sink or the dead-letter queue (DLQ). This means that if the source provides replayability, VoltSP can guarantee at-least-once delivery for the whole pipeline. For sources that do not offer such guarantees, data loss may occur in case of failures, so it's important to choose appropriate sources based on your reliability requirements.
The core idea is: when you need state or coordination, you delegate that responsibility to VoltDB — an in-memory distributed SQL database that handles these operations. This separation of concerns keeps streaming workers simple while state management is handled by the database layer.
This approach requires a different way of thinking compared to traditional frameworks, but it can lead to simpler architectures for many use cases.
Processing Without Coordination¶
In many streaming scenarios, coordination between workers is either unnecessary or only minimally required. Consider these common use cases:
- Filtering: Selecting records based on conditions
- Enrichment: Adding additional data to records
- Transformation: Converting records from one format to another
- Pre-aggregation: Computing partial aggregates that will be combined later
None of these operations require workers to communicate with each other. Each worker can process its partition of the data stream independently.
Global State¶
When you need shared state across workers, VoltDB can serve as the coordination point. Consider a scenario where you need to track a global maximum value while processing data in parallel across multiple workers.
In VoltSP, the approach works as follows:
- Each worker computes its local maximum from the records it processes.
- Workers call a VoltDB stored procedure to update the global state.
- The stored procedure atomically compares the incoming value with the current global maximum and updates it if necessary.
This pattern provides a consistent global view of the maximum value stored in VoltDB, without direct coordination between workers. The stored procedure's atomicity guarantees correctness.
Once you have the global state in VoltDB, you have options for how to use it:
- Use it as a pipeline sink: The global aggregate will be continuously updated as new data arrives, and external systems can query the current value from VoltDB whenever they need it.
- Use an emitter: Retrieve the updated global value and continue processing it in the pipeline — for example, transform it, enrich it with additional context, and send it to another sink.
Aggregations¶
If you are only interested in the final aggregation result rather than the raw
data, you can use a STREAM instead of a TABLE in VoltDB. A STREAM is a
virtual table that is not persisted to disk but can act as a source for a VIEW.
This allows you to use a VIEW to compute aggregations incrementally, avoiding
the overhead of manual SELECT calculations. When you insert data into a
STREAM, the associated view is automatically updated, and you can immediately
retrieve the current aggregation state within the same stored procedure.
Partitioning¶
If your application requires aggregations or processing across different keys or multiple partitioning levels, VoltSP handles this through the use of partitioned stored procedures. Each procedure is executed on a specific partition determined by a partition key.
For complex logic involving multiple levels of partitioning, you can chain procedures together. The first procedure processes data in its partition and returns a result that can then be used as the partition key for a subsequent procedure call. This sequential execution allows for sophisticated multi-stage partitioning while maintaining high performance and data locality for each individual step.
Global Windowing¶
When you need to compute aggregates over a global window, VoltSP provides the Volt Window feature. This is useful when you want to aggregate data from all workers into a single window result.
Here's how it works:
- Each worker sends its local data to a global window managed by VoltDB.
- VoltDB collects contributions from all workers and maintains the window state.
- When the window closes (based on the window type configuration), VoltDB computes the final aggregate and emits the result further down the pipeline, triggered according to the emitter API configuration.
This architecture allows multiple workers to contribute data in parallel while the final windowed result is computed in a single location. The coordination of window boundaries is handled by VoltDB.