Beta: VoltSP YAML Pipeline Definition Language¶

VoltSP provides a declarative YAML configuration language for defining streaming data pipelines without writing Java code. This document describes the structure and options available in the YAML configuration format.

Basic Structure¶

A VoltSP pipeline configuration requires the following main sections:

version: 1              # Required: Configuration version (must be 1)
name: "pipeline-name"   # Required: Pipeline name
source: { }            # Required: Source configuration
pipeline: { }          # Optional: Processing steps to apply
sink: { }             # Required: Sink configuration
logging: { }          # Optional: Logging configuration

Configuration Sections¶

Version¶

Must be 1. This field is required.

version: 1

Name¶

Pipeline name that will be visible in the logs as well as metrics. This field is required.

name: "my-pipeline"

Source Configuration¶

The source section defines where the pipeline gets its data. You must specify exactly one source. All sources available to the Java DSL are supported.

Each source type has its own configuration parameters.

Pipeline Configuration¶

The pipeline section defines processing configuration and data transformations that should be applied. It includes:

parallelism: Optional value specifying pipeline parallelism
processors: Optional array of processor configurations

Sink Configuration¶

The sink section defines where the pipeline outputs its data. You must specify exactly one sink type. All sinks available to the Java DSL are supported.

Logging Configuration¶

Note: Not yet implemented

The optional logging section configures logging behavior:

logging:
    globalLevel: "DEBUG"        # Global log level
    loggers: # Per-logger configuration
        "org.myapp": "TRACE"
        "org.thirdparty": "WARN"

Source Types¶

Some example of simple source configurations.

File Source¶

Reads data from a file:

source:
    file:
        path: "input.txt"       # Required: Path to input file

Stdin Source¶

Reads data from standard input:

source:
    stdin: { }

Collection Source¶

Reads from a static collection of strings:

source:
    collection:
        elements: # Required: Array of strings
            - "element1"
            - "element2"

Network Source¶

Reads from network:

source:
    network:
        type: "UDP"                        # Required: UDP or TCP
        address: "0.0.0.0:12345"           # Required: Port number or address:port
        decoder: "lines"                   # Required: Decoder type (none/identity/line/bytes)

Beats Source¶

Reads from Elastic Beats:

source:
    beats:
        address: "0.0.0.0:514"             # Required: Listen address
        clientInactivityTimeout: "PT30S"   # Optional: Connection idle timeout (ISO8601 duration)

Sink Types¶

VoltDB Sink¶

Outputs to VoltDB:

sink:
    voltdb-procedure:
        procedureName: "MyStoredProc"      # Required: Stored procedure name
        servers: "voltdb-host:21212"       # Required: VoltDB host
        client:
            retires: 3                     # Optional: Number of retries

Processor Types¶

Note: Not yet implemented

Processors can be written in multiple languages and are defined in the pipeline's processors array. Each processor must specify its language and code:

pipeline:
    processors:
        -   javascript:
                code: "message.toUpperCase()"
        -   python:
                code: |
                    import re
                    def process(message):
                        return message.lower()
                    process(message)
        -   ruby:
                code: |
                    message.reverse

Complete Examples¶

Simple File Processing Pipeline¶

version: 1
name: "file-processor"

source:
    file:
        path: "input.txt"

pipeline:
    parallelism: 1
    processors:
        -   javascript:
                code: |
                    message.toUpperCase();

sink:
    file:
        dirPath: "/tmp"

Kafka to VoltDB Pipeline¶

version: 1
name: "kafka-to-voltdb"

source:
    kafka:
        bootstrapServers:
            - "kafka1:9092"
            - "kafka2:9092"
        topicNames:
            - "incoming-data"
        groupId: "processor-group"
        startingOffset: "LATEST"

pipeline:
    parallelism: 4
    processors:
        -   javascript:
                code: |
                    // Transform message
                    JSON.parse(message)

sink:
    voltdb-procedure:
        procedureName: "ProcessData"
        servers: "voltdb-host:21212"
        client:
            retires: 3

logging:
    globalLevel: "INFO"
    loggers:
        org.voltdb: "DEBUG"

Network to Network Pipeline¶

version: 1
name: "network-relay"

source:
    network:
        type: "UDP"
        address: "0.0.0.0:12345"
        decoder: "lines"

pipeline:
    parallelism: 1

sink:
    network:
        type: "UDP"
        address: "target-host:54321"