Skip to content

Beta: VoltSP YAML Pipeline Definition Language

VoltSP provides a declarative YAML configuration language for defining streaming data pipelines without writing Java code. This document describes the structure and options available in the YAML configuration format.

Basic Structure

A VoltSP pipeline configuration requires the following main sections:

version: 1              # Required: Configuration version (must be 1)
name: "pipeline-name"   # Required: Pipeline name
source: { }            # Required: Source configuration
pipeline: { }          # Optional: Processing steps to apply
sink: { }             # Required: Sink configuration
logging: { }          # Optional: Logging configuration

Configuration Sections

Version

Must be 1. This field is required.

version: 1

Name

Pipeline name that will be visible in the logs as well as metrics. This field is required.

name: "my-pipeline"

Source Configuration

The source section defines where the pipeline gets its data. You must specify exactly one source. All sources available to the Java DSL are supported.

Each source type has its own configuration parameters.

Pipeline Configuration

The pipeline section defines processing configuration and data transformations that should be applied. It includes:

  • parallelism: Optional value specifying pipeline parallelism
  • processors: Optional array of processor configurations

Sink Configuration

The sink section defines where the pipeline outputs its data. You must specify exactly one sink type. All sinks available to the Java DSL are supported.

Logging Configuration

Note: Not yet implemented

The optional logging section configures logging behavior:

logging:
    globalLevel: "DEBUG"        # Global log level
    loggers: # Per-logger configuration
        "org.myapp": "TRACE"
        "org.thirdparty": "WARN"

Source Types

Some example of simple source configurations.

File Source

Reads data from a file:

source:
    file:
        path: "input.txt"       # Required: Path to input file

Stdin Source

Reads data from standard input:

source:
    stdin: { }

Collection Source

Reads from a static collection of strings:

source:
    collection:
        elements: # Required: Array of strings
            - "element1"
            - "element2"

Network Source

Reads from network:

source:
    network:
        type: "UDP"                        # Required: UDP or TCP
        address: "0.0.0.0:12345"           # Required: Port number or address:port
        decoder: "lines"                   # Required: Decoder type (none/identity/line/bytes)

Beats Source

Reads from Elastic Beats:

source:
    beats:
        address: "0.0.0.0:514"             # Required: Listen address
        clientInactivityTimeout: "PT30S"   # Optional: Connection idle timeout (ISO8601 duration)

Sink Types

VoltDB Sink

Outputs to VoltDB:

sink:
    voltdb-procedure:
        procedureName: "MyStoredProc"      # Required: Stored procedure name
        servers: "voltdb-host:21212"       # Required: VoltDB host
        client:
            retires: 3                     # Optional: Number of retries

Processor Types

Note: Not yet implemented

Processors can be written in multiple languages and are defined in the pipeline's processors array. Each processor must specify its language and code:

pipeline:
    processors:
        -   javascript:
                code: "message.toUpperCase()"
        -   python:
                code: |
                    import re
                    def process(message):
                        return message.lower()
                    process(message)
        -   ruby:
                code: |
                    message.reverse

Complete Examples

Simple File Processing Pipeline

version: 1
name: "file-processor"

source:
    file:
        path: "input.txt"

pipeline:
    parallelism: 1
    processors:
        -   javascript:
                code: |
                    message.toUpperCase();

sink:
    file:
        dirPath: "/tmp"

Kafka to VoltDB Pipeline

version: 1
name: "kafka-to-voltdb"

source:
    kafka:
        bootstrapServers:
            - "kafka1:9092"
            - "kafka2:9092"
        topicNames:
            - "incoming-data"
        groupId: "processor-group"
        startingOffset: "LATEST"

pipeline:
    parallelism: 4
    processors:
        -   javascript:
                code: |
                    // Transform message
                    JSON.parse(message)

sink:
    voltdb-procedure:
        procedureName: "ProcessData"
        servers: "voltdb-host:21212"
        client:
            retires: 3

logging:
    globalLevel: "INFO"
    loggers:
        org.voltdb: "DEBUG"

Network to Network Pipeline

version: 1
name: "network-relay"

source:
    network:
        type: "UDP"
        address: "0.0.0.0:12345"
        decoder: "lines"

pipeline:
    parallelism: 1

sink:
    network:
        type: "UDP"
        address: "target-host:54321"