15.9. The Kafka Export Connector

Documentation

VoltDB Home » Documentation » Using VoltDB

15.9. The Kafka Export Connector

The Kafka connector receives serialized data from the export streams and writes it to a message queue using the Apache Kafka version 0.8.2 protocols. Apache Kafka is a distributed messaging service that lets you set up message queues which are written to and read from by "producers" and "consumers", respectively. In the Apache Kafka model, VoltDB export acts as a "producer".

Before using the Kafka connector, we strongly recommend reading the Kafka documentation and becoming familiar with the software, since you will need to set up a Kafka 0.8.2 service and appropriate "consumer" clients to make use of VoltDB's Kafka export functionality. The instructions in this section assume a working knowledge of Kafka and the Kafka operational model.

When the Kafka connector receives data from the VoltDB export streams, it establishes a connection to the Kafka messaging service as a Kafka producer. It then writes records to Kafka topics based on the VoltDB stream name and certain export connector properties.

The majority of the Kafka export properties are identical in both in name and content to the Kafka producer properties listed in the Kafka documentation. All but one of these properties are optional for the Kafka connector and will use the standard Kafka default value. For example, if you do not specify the queue.buffering.max.ms property it defaults to 5000 milliseconds.

The only required property is bootstrap.servers, which lists the Kafka servers that the VoltDB export connector should connect to. You must include this property so VoltDB knows where to send the export data. Specify each server by its IP address (or hostname) and port; for example, myserver:7777. If there are multiple servers in the list, separate them with commas.

In addition to the standard Kafka producer properties, there are several custom properties specific to VoltDB. The properties binaryencoding, skipinternals, and timezone affect the format of the data. The topic.prefix and topic.key properties affect how the data is written to Kafka.

The topic.prefix property specifies the text that precedes the stream name when constructing the Kafka topic. If you do not specify a prefix, it defaults to "voltdbexport". Alternately, you can map individual streams to topics using the topic.key property. In the topic.key property you associate a VoltDB export stream name with the corresponding Kafka topic as a named pair separated by a period (.). Multiple named pairs are separated by commas (,). For example:

Employee.EmpTopic,Company.CoTopic,Enterprise.EntTopic

Any stream-specific mappings in the topic.key property override the automated topic name specified by topic.prefix.

Note that unless you configure the Kafka brokers with the auto.create.topics.enable property set to true, you must create the topics for every export stream manually before starting the export process. Enabling auto-creation of topics when setting up the Kafka brokers is recommended.

When configuring the Kafka export connector, it is important to understand the relationship between synchronous versus asynchronous processing and its effect on database latency. If the export data is sent asynchronously, the impact of export on the database is reduced, since the export connector does not wait for the Kafka infrastructure to respond. However, with asynchronous processing, VoltDB is not able to resend the data if the message fails after it is sent.

If export to Kafka is done synchronously, the export connector waits for acknowledgement of each message sent to Kafka before processing the next packet. This allows the connector to resend any packets that fail. The drawback to synchronous processing is that on a heavily loaded database, the latency it introduces means export may not be able to keep up with the influx of export data and and have to write to overflow.

You specify the level of synchronicity and durability of the connection using the Kafka acks property. Set acks to "0" for asynchronous processing, "1" for synchronous delivery to the Kafka broker, or "all" to ensure durability on the Kafka broker. Use of "all" is not recommended for VoltDB export. See the Kafka documentation for more information.

VoltDB guarantees that at least one copy of all export data is sent by the export connector. But when operating in asynchronous mode, the Kafka connector cannot guarantee that the packet is actually received and accepted by the Kafka broker. By operating in synchronous mode, VoltDB can catch errors returned by the Kafka broker and resend any failed packets. However, you pay the penalty of additional latency and possible export overflow.

Finally, the actual export data is sent to Kafka as a comma-separated values (CSV) formatted string. The message includes six columns of metadata (such as the transaction ID and timestamp) followed by the column values of the export stream.

Table 15.4, “Kafka Export Properties” lists the supported properties for the Kafka connector, including the standard Kafka producer properties and the VoltDB unique properties.

Table 15.4. Kafka Export Properties

PropertyAllowable ValuesDescription
bootstrap.servers*string

A comma-separated list of Kafka brokers (specified as IP-address:port-number). You can use metadata.broker.list as a synonym for bootstrap.servers.

acks0, 1, allSpecifies whether export is synchronous (1 or all) or asynchronous (0) and to what extent it ensures delivery. See the Kafka documentation of the producer properties for details.
acks.retry.timeoutintegerSpecifies how long, in milliseconds, the connector will wait for acknowledgment from Kafka for each packet. The retry timeout only applies if acknowledgements are enabled. That is, if the acks property is set greater than zero. The default timeout is 5,000 milliseconds. When the timeout is reached, the connector will resend the packet of messages.
partition.key{stream}.{column}[,...]

Specifies which stream column value to use as the Kafka partitioning key for each stream. Kafka uses the partition key to distribute messages across multiple servers.

By default, the value of the stream's partitioning column is used as the Kafka partition key. Using this property you can specify a list of stream column names, where the stream name and column name are separated by a period and the list of stream references is separated by commas. If the stream is not partitioned and you do not specify a key, the server partition ID is used as a default.

binaryencodinghex, base64Specifies whether VARBINARY data is encoded in hexadecimal or BASE64 format. The default is hexadecimal.
skipinternalstrue, falseSpecifies whether to include six columns of VoltDB metadata (such as transaction ID and timestamp) in the output. If you specify skipinternals as true, the output contains only the exported stream data. The default is false.
timezonestringThe time zone to use when formatting the timestamp. Specify the time zone as a Java timezone identifier. The default is GMT.
topic.keystring

A set of named pairs each identifying a VoltDB stream name and the corresponding Kafka topic name to which the stream is written. Separate the names with a period (.) and the name pairs with a comma (,).

The specific stream/topic mappings declared by topic.key override the automated topic names specified by topic.prefix.

topic.prefixstringThe prefix to use when constructing the topic name. Each row is sent to a topic identified by {prefix}{stream-name}. The default prefix is "voltdbexport".
Kafka producer propertiesvarious

You can specify standard Kafka producer properties as export connector properties and their values will be passed to the Kafka interface. However, you cannot modify the property block.on.buffer.full.

*Required