15.3. VoltDB Export Connectors

Documentation

VoltDB Home » Documentation » Using VoltDB

15.3. VoltDB Export Connectors

You use the EXPORT TO TARGET or MIGRATE TO TARGET clauses to identify the sources of export and start queuing export data. To enable the actual transmission of data to an export target at runtime, you include the <export> and <configuration> tags in the configuration file. You can configure the export targets when you initialize the database root directory. Or you can add or modify the export configuration while the database is running using the voltadmin update command.

In the configuration file, the export and configuration tags specify the target you are configuring and which export connector to use (with the type attribute). To export to multiple destinations, you include multiple <configuration> tags, each specifying the target it is configuring. For example:

<export>
   <configuration enabled="true" type="file" target="log">
     . . .
   </configuration>
   <configuration enabled="true" type="jdbc" target="archive">
     . . .
   </configuration>
</export>

You configure each export connector by specifying properties as one or more <property> tags within the <configuration> tag. For example, the following XML code enables export to comma-separated (CSV) text files using the file prefix "MyExport".

<export>
   <configuration enabled="true" target="log" type="file">
     <property name="type">csv</property>
     <property name="nonce">MyExport</property>
  </configuration>
</export>

The properties that are allowed and/or required depend on the export connector you select. VoltDB comes with five export connectors:

In addition to the connectors shipped as part of the VoltDB software kit, an export connector for Amazon Kinesis is available from the VoltDB public Github repository (https://github.com/VoltDB/export-kinesis).

15.3.1. How Export Works

Two important points about export to keep in mind are:

  • Data is queued for export as soon you declare a stream or table with the EXPORT TO TARGET clause and write to it. Even if the export target has not been configured yet. Be careful not to declare export sources and forget to configure their targets, or else the export queues could grow and cause disk space issues. Similarly, when you drop the stream or table, its export queue is deleted, even if there is data waiting to be delivered to the configured export target.

  • VoltDB will send at least one copy of every export record to the target. It is possible, when recovering command logs or rejoining nodes, that certain export records are resent. It is up to the downstream target to handle these duplicate records. For example, using unique indexes or including a unique record ID in the export stream.

All nodes in a cluster queue export data, but only one actually writes to the external target. If one or more nodes fail, responsibility for writing to the export targets is transferred to another currently active server. It is possible for gaps to appear in the export queues while servers are offline. Normally if a gap is found, it is not a problem because another node can take over responsibility for writing (and queuing) export data.

However, in unusual cases where export falls behind and nodes fail and rejoin consecutively, it is possible for gaps to occur in all the available queues. When this happens, VoltDB issues a warning to the console (and via SNMP) and waits for the missing data to be resolved. You can also use the @Statistics system procedure with the EXPORT selector to determine exactly what records are and are not present in the queues. If the gap cannot be resolved (usually by rejoining a failed server), you must use the voltadmin export release command to free the queue and resume export at the next available record.

15.3.1.1. Export Overflow

VoltDB uses persistent files on disk to queue export data waiting to be written to its specified target. If for any reason the export target can not keep up with the connector, VoltDB writes the excess data in the export buffer from memory to disk. This protects your database in several ways:

  • If the destination target is not configured, is unreachable, or cannot keep up with the data flow, writing to disk helps VoltDB avoid consuming too much memory while waiting for the destination to accept the data.

  • If the database stops, the export data is retained across sessions. When the database restarts, the connector will retrieve the overflow data and reinsert it in the export queue.

Even when the target does keep up with the flow, some amount of data is written to the overflow directory to ensure durability across database sessions. You can specify where VoltDB writes the overflow export data using the <exportoverflow> element in the configuration file. For example:

<paths>
   <exportoverflow path="/tmp/export/"/>
</paths>

If you do not specify a path for export overflow, VoltDB creates a subfolder in the database root directory. See Section 3.7.2, “Configuring Paths for Runtime Features” for more information about configuring paths in the configuration file.

15.3.1.2. Persistence Across Database Sessions

It is important to note that VoltDB only uses the disk storage for overflow data. However, you can force VoltDB to write all queued export data to disk using any of the following methods:

  • Calling the @Quiesce system procedure

  • Requesting a blocking snapshot (using voltadmin save --blocking)

  • Performing an orderly shutdown (using voltadmin shutdown)

This means that if you perform an orderly shutdown with the voltadmin shutdown command, you can recover the database — and any pending export queue data — by simply restarting the database cluster in the same root directories.

Note that when you initialize or re-initialize a root directory, any subdirectories of the root are purged.[5] So if your configuration did not specify a different location for the export overflow, and you re-initialize the root directories and then restore the database from a snapshot, the database is restored but the export overflow will be lost. If both your original and new configuration use the same, explicit directory outside the root directory for export overflow, you can start a new database and restore a snapshot without losing the overflow data.

15.3.2. The File Export Connector

The file connector receives the serialized data from the export source and writes it out as text files (either comma or tab separated) to disk. The file connector writes the data out one file per source table or stream, "rolling" over to new files periodically. The filenames of the exported data are constructed from:

  • A unique prefix (specified with the nonce property)

  • A unique value identifying the current version of the database schema

  • The stream or table name

  • A timestamp identifying when the file was started

  • Optionally, the ID of the host server writing the file

While the file is being written, the file name also contains the prefix "active-". Once the file is complete and a new file started, the "active-" prefix is removed. Therefore, any export files without the prefix are complete and can be copied, moved, deleted, or post-processed as desired.

There is only one required property that must be set when using the file connector. The nonce property specifies a unique prefix to identify all files that the connector writes out for this database instance. All other properties are optional and have default values.

Table 15.1, “File Export Properties” describes the supported properties for the file connector.

Table 15.1. File Export Properties

PropertyAllowable ValuesDescription
nonce*stringA unique prefix for the output files.
typecsv, tsvSpecifies whether to create comma-separated (CSV) or tab-delimited (TSV) files. CSV is the default format.
outdirdirectory pathThe directory where the files are created. Relative paths are relative to the database root directory on each server. If you do not specify an output path, VoltDB writes the output files into a subfolder of the root directory itself.
periodIntegerThe frequency, in minutes, for "rolling" the output file. The default frequency is 60 minutes.
retentionstring

Specifies how long exported files are retained. You specify the retention period as an integer number and a time unit identifier from the following list:

  • s — Seconds

  • m — Minutes

  • h — Hours

  • d — Days

For example, "31d" sets the retention limit at 31 days. After files exceed the specified time limit, they are deleted by the export subsystem. The default is to retain all files indefinitely.

binaryencodinghex, base64Specifies whether VARBINARY data is encoded in hexadecimal or BASE64 format. The default is hexadecimal.
charsetstringSpecifies the character set encoding to use when writing VARCHAR columns to the output stream. The default character encoding is UTF-8.
dateformatformat stringThe format of the date used when constructing the output file names. You specify the date format as a Java SimpleDateFormat string. The default format is "yyyyMMddHHmmss".
timezonestringThe time zone to use when formatting the timestamp. Specify the time zone as a Java TimeZone identifier. For example, you can specify a continent and region ("America/New_York") or a time zone acronym ("EST"). The default is GMT.
delimitersstring

Specifies the delimiter characters for CSV output. The text string specifies four characters in the following order: the separator, the quote character, the escape character, and the end-of-line character.

Non-printing characters must be encoded as Java literals. For example, the new line character (ASCII code 13) should be entered as "\n". Alternately, you can use Java Unicode literals, such as "\u000d". You must also encode any XML special characters, such as the ampersand and left angle bracket as HTML entities for inclusion in the XML configuration file. For example encoding "<" as "&gt;".

The following property definition matches the default delimiters. That is, the comma, the double quotation character twice (as both the quote and escape delimiters) and the new line character:

<property name="delimiter">,""\n</property>
batchedtrue, falseSpecifies whether to store the output files in subfolders that are "rolled" according to the frequency specified by the period property. The subfolders are named according to the nonce and the timestamp, with "active-" prefixed to the subfolder currently being written.
skipinternalstrue, falseSpecifies whether to include six columns of VoltDB metadata (such as transaction ID and timestamp) in the output. If you specify skipinternals as "true", the output files contain only the exported data.
uniquenamestrue, falseSpecifies whether to include the host ID in the file name to ensure that all files written are unique across a cluster. The export files are always unique per server. But if you plan to write all cluster files to a network drive or copy them to a single location, set this property to true to avoid any possible conflict in the file names. The default is false.
with-schematrue, falseSpecifies whether to write a JSON representation of the source's schema as part of the export. The JSON schema files can be used to ensure the appropriate datatype and precision is maintained if and when the output files are imported into another system.

*Required


Whatever properties you choose, the order and representation of the content within the output files is the same. The export connector writes a separate line of data for every INSERT it receives, including the following information:

  • Six columns of metadata generated by the export connector.

  • The remaining columns are the columns of the database source, in the same order as they are listed in the database definition (DDL) file.

Table 15.2, “Export Metadata” describes the six columns of metadata generated by the export connector and the meaning of each column.

Table 15.2. Export Metadata

ColumnDatatypeDescription
Transaction IDBIGINTIdentifier uniquely identifying the transaction that generated the export record.
TimestampTIMESTAMPThe time when the export record was generated.
Sequence NumberBIGINTFor internal use.
Partition IDBIGINTIdentifies the partition that sent the record to the export target.
Site IDBIGINTIdentifies the site that sent the record to the export target.
Export OperationTINYINT

A single byte value identifying the type of transaction that initiated the export. Possible values include:

  • 1 — insert

  • 2 — delete

  • 3 — update (record before update)

  • 4 — update (record after update)

  • 5 — migration


15.3.3. The HTTP Export Connector

The HTTP connector receives the serialized data from the export streams and writes it out via HTTP requests. The connector is designed to be flexible enough to accommodate most potential targets. For example, the connector can be configured to send out individual records using a GET request or batch multiple records using POST and PUT requests. The connector also contains optimizations to support export to Hadoop via WebHDFS.

15.3.3.1. Understanding HTTP Properties

The HTTP connector is a general purpose export utility that can export to any number of destinations from simple messaging services to more complex REST APIs. The properties work together to create a consistent export process. However, it is important to understand how the properties interact to configure your export correctly. The four key properties you need to consider are:

  • batch.mode — whether data is exported in batches or one record at a time

  • method — the HTTP request method used to transmit the data

  • type — the format of the output

  • endpoint — the target HTTP URL to which export is written

The properties are described in detail in Table 15.3, “HTTP Export Properties”. This section explains the relationship between the properties.

There are essentially two types of HTTP export: batch mode and one record at a time. Batch mode is appropriate for exporting large volumes of data to targets such as Hadoop. Exporting one record at a time is less efficient for large volumes but can be very useful for writing intermittent messages to other services.

In batch mode, the data is exported using a POST or PUT method, where multiple records are combined in either comma-separated value (CSV) or Avro format in the body of the request. When writing one record at a time, you can choose whether to submit the HTTP request as a POST, PUT or GET (that is, as a querystring attached to the URL). When exporting in batch mode, the method must be either POST or PUT and the type must be either csv or avro. When exporting one record at a time, you can use the GET, POST, or PUT method, but the output type must be form.

Finally, the endpoint property specifies the target URL where data is being sent, using either the http: or https: protocol. Again, the endpoint must be compatible with the possible settings for the other properties. In particular, if the endpoint is a WebHDFS URL, batch mode must enabled.

The URL can also contain placeholders that are filled in at runtime with metadata associated with the export data. Each placeholder consists of a percent sign (%) and a single ASCII character. The following are the valid placeholders for the HTTP endpoint property:

PlaceholderDescription
%tThe name of the VoltDB export source table or stream. The source name is inserted into the endpoint in all uppercase.
%pThe VoltDB partition ID for the partition where the INSERT query to the export source is executing. The partition ID is an integer value assigned by VoltDB internally and can be used to randomly partition data. For example, when exporting to webHDFS, the partition ID can be used to direct data to different HDFS files or directories.
%gThe export generation. The generation is an identifier assigned by VoltDB. The generation increments each time the database starts or the database schema is modified in any way.
%d

The date and hour of the current export period. Applicable to WebHDFS export only. This placeholder identifies the start of each period and the replacement value remains the same until the period ends, at which point the date and hour is reset for the new period.

You can use this placeholder to "roll over" WebHDFS export destination files on a regular basis, as defined by the period property. The period property defaults to one hour.

When exporting in batch mode, the endpoint must contain at least one instance each of the %t, %p, and %g placeholders. However, beyond that requirement, it can contain as many placeholders as desired and in any order. When not in batch mode, use of the placeholders are optional.

Table 15.3, “HTTP Export Properties” describes the supported properties for the HTTP connector.

Table 15.3. HTTP Export Properties

PropertyAllowable ValuesDescription
endpoint*stringSpecifies the target URL. The endpoint can contain placeholders for inserting the source name (%t), the partition ID (%p), the date and hour (%d), and the export generation (%g).
avro.compresstrue, falseSpecifies whether Avro output is compressed or not. The default is false and this property is ignored if the type is not Avro.
avro.schema.locationstringSpecifies the location where the Avro schema will be written. The schema location can be either an absolute path name on the local database server or a webHDFS URL and must include at least one instance of the placeholder for the source name (%t). Optionally it can contain other instances of both %t and %g. The default location for the Avro schema is the file path export/avro/%t_avro_schema.json on the database server under the voltdbroot directory. This property is ignored if the type is not Avro.
batch.modetrue, falseSpecifies whether to send multiple rows as a single request or send each export row separately. The default is true. Batch mode must be enabled for WebHDFS export.
httpfs.enabletrue, falseSpecifies that the target of WebHDFS export is an Apache HttpFS (Hadoop HDFS over HTTP) server. This property must be set to true when exporting webHDFS to HttpFS targets.
kerberos.enabletrue, falseSpecifies whether Kerberos authentication is used when connecting to a WebHDFS endpoint. This property is only valid when connecting to WebHDFS servers and is false by default.
methodget, post, putSpecifies the HTTP method for transmitting the export data. The default method is POST. For WebHDFS export, this property is ignored.
periodIntegerSpecifies the frequency, in hours, for "rolling" the WebHDFS output date and time. The default frequency is every hour (1). For WebHDFS export only.
timezonestringThe time zone to use when formatting the timestamp. Specify the time zone as a Java TimeZone identifier. For example, you can specify a continent and region ("America/New_York") or a time zone acronym ("EST"). The default is the local time zone.
typecsv, avro, formSpecifies the output format. If batch.mode is true, the default type is CSV. If batch.mode is false, the default and only allowable value for type is form. Avro format is supported for WebHDFS export only (see Section 15.3.3.2, “Exporting to Hadoop via WebHDFS” for details.)

*Required


15.3.3.2. Exporting to Hadoop via WebHDFS

As mentioned earlier, the HTTP connector contains special optimizations to support exporting data to Hadoop via the WebHDFS protocol. If the endpoint property contains a WebHDFS URL (identified by the URL path component starting with the string "/webhdfs/v1/"), special rules apply.

First, for WebHDFS URLs, the batch.mode property must be enabled. Also, the endpoint must have at least one instance each of the source name (%t), the partition ID (%p), and the export generation (%g) placeholders and those placeholders must be part of the URL path, not the domain or querystring.

Next, the method property is ignored. For WebHDFS, the HTTP connector uses a combination of POST, PUT, and GET requests to perform the necessary operations using the WebHDFS REST API.

For example, The following configuration file excerpt exports stream data to WebHDFS using the HTTP connector and writing each stream to a separate directory, with separate files based on the partition ID, generation, and period timestamp, rolling over every 2 hours:

<export>
   <configuration target="hadoop" enabled="true" type="http">
     <property name="endpoint">
        http://myhadoopsvr/webhdfs/v1/%t/data%p-%g.%d.csv
     </property>
     <property name="batch.mode">true</property>
     <property name="period">2</property>
  </configuration>
</export>

Note that the HTTP connector will create any directories or files in the WebHDFS endpoint path that do not currently exist and then append the data to those files, using the POST or PUT method as appropriate for the WebHDFS REST API.

You also have a choice between two formats for the export data when using WebHDFS: comma-separated values (CSV) and Apache Avro™ format. By default, data is written as CSV data with each record on a separate line and batches of records attached as the contents of the HTTP request. However, you can choose to set the output format to Avro by setting the type property, as in the following example:

<export>
   <configuration target="hadoop" enabled="true" type="http">
     <property name="endpoint">
       http://myhadoopsvr/webhdfs/v1/%t/data%p-%g.%d.avro
     </property>
     <property name="type">avro</property>
     <property name="avro.compress">true</property>
     <property name="avro.schema.location">
       http://myhadoopsvr/webhdfs/v1/%t/schema.json
     </property>
  </configuration>
</export>

Avro is a data serialization system that includes a binary format that is used natively by Hadoop utilities such as Pig and Hive. Because it is a binary format, Avro data takes up less network bandwidth than text-based formats such as CSV. In addition, you can choose to compress the data even further by setting the avro.compress property to true, as in the previous example.

When you select Avro as the output format, VoltDB writes out an accompanying schema definition as a JSON document. For compatibility purposes, the source name and columns names are converted, removing underscores and changing the resulting words to lowercase with initial capital letters (sometimes called "camelcase"). The source name is given an initial capital letter, while columns names start with a lowercase letter. For example, the stream EMPLOYEE_DATA and its column named EMPLOYEE_iD would be converted to EmployeeData and employeeId in the Avro schema.

By default, the Avro schema is written to a local file on the VoltDB database server. However, you can specify an alternate location, including a webHDFS URL. So, for example, you can store the schema in the same HDFS repository as the data by setting the avro.schema.location property, as shown in the preceding example.

See the Apache Avro web site for more details on the Avro format.

15.3.3.3. Exporting to Hadoop Using Kerberos Security

If the WebHDFS service to which you are exporting data is configured to use Kerberos security, the VoltDB servers must be able to authenticate using Kerberos as well. To do this, you must perform the following two extra steps:

  • Configure Kerberos security for the VoltDB cluster itself

  • Enable Kerberos authentication in the export configuration

The first step is to configure the VoltDB servers to use Kerberos as described in Section 12.9, “Integrating Kerberos Security with VoltDB”. Because use of Kerberos authentication for VoltDB security changes how the clients connect to the database cluster, It is best to set up, enable, and test Kerberos authentication first to ensure your client applications work properly in this environment before trying to enable Kerberos export as well.

Once you have Kerberos authentication working for the VoltDB cluster, you can enable Kerberos authentication in the configuration of the WebHDFS export target as well. Enabling Kerberos authentication in the HTTP connector only requires one additional property, kerberos.enable, to be set. To use Kerberos authentication, set the property to "true". For example:

<export>
   <configuration target="hadoop" enabled="true" type="http">
     <property name="endpoint">
       http://myhadoopsvr/webhdfs/v1/%t/data%p-%g.%d.csv
     </property>
     <property name="type">csv</property>
     <property name="kerberos.enable">true</property>
  </configuration>
</export>

Note that Kerberos authentication is only supported for WebHDFS endpoints.

15.3.4. The JDBC Export Connector

The JDBC connector receives the serialized data from the export source and writes it, in batches, to another database through the standard JDBC (Java Database Connectivity) protocol.

By default, when the JDBC connector opens the connection to the remote database, it first attempts to create tables in the remote database to match the VoltDB export source by executing CREATE TABLE statements through JDBC. This is important to note because, it ensures there are suitable tables to receive the exported data. The tables are created using either the names from the VoltDB schema or (if you do not enable the ignoregenerations property) the name prefixed by the database generation ID.

If the target database has existing tables that match the VoltDB export sources in both name and structure (that is, the number, order, and datatype of the columns), be sure to enable the ignoregenerations property in the export configuration to ensure that VoltDB uses those tables as the export target.

It is also important to note that the JDBC connector exports data through JDBC in batches. That is, multiple INSERT instructions are passed to the target database at a time, in approximately two megabyte batches. There are two consequences of the batching of export data:

  • For many databases, such as Netezza, where there is a cost for individual invocations, batching reduces the performance impact on the receiving database and avoids unnecessary latency in the export processing.

  • On the other hand, no matter what the target database, if a query fails for any reason the entire batch fails.

To avoid errors causing batch inserts to fail, it is strongly recommended that the target database not use unique indexes on the receiving tables that might cause constraint violations.

If any errors do occur when the JDBC connector attempts to submit data to the remote database, the VoltDB disconnects and then retries the connection. This process is repeated until the connection succeeds. If the connection does not succeed, VoltDB eventually reduces the retry rate to approximately every eight seconds.

Table 15.4, “JDBC Export Properties” describes the supported properties for the JDBC connector.

Table 15.4. JDBC Export Properties

PropertyAllowable ValuesDescription
jdbcurl*connection stringThe JDBC connection string, also known as the URL.
jdbcuser*stringThe username for accessing the target database.
jdbcpasswordstringThe password for accessing the target database.
jdbcdriverstring

The class name of the JDBC driver. The JDBC driver class must be accessible to the VoltDB process for the JDBC export process to work. Place the driver JAR files in the lib/extension/ directory where VoltDB is installed to ensure they are accessible at runtime.

You do not need to specify the driver as a property value for several popular databases, including MySQL, Netezza, Oracle, PostgreSQL, and Vertica. However, you still must provide the driver JAR file.

schemastringThe schema name for the target database. The use of the schema name is database specific. In some cases you must specify the database name as the schema. In other cases, the schema name is not needed and the connection string contains all the information necessary. See the documentation for the JDBC driver you are using for more information.
minpoolsizeintegerThe minimum number of connections in the pool of connections to the target database. The default value is 10.
maxpoolsizeintegerThe maximum number of connections in the pool. The default value is 100.
maxidletimeintegerThe number of milliseconds a connection can be idle before it is removed from the pool. The default value is 60000 (one minute).
maxstatementcachedintegerThe maximum number of statements cached by the connection pool. The default value is 50.
createtabletrue, falseSpecifies whether VoltDB should create the corresponding table in the remote database. By default , VoltDB creates the table(s) to receive the exported data. (That is, the default is true.) If you set this property to false, you must create table(s) with matching names to the VoltDB export sources before starting the export connector.
lowercasetrue, falseSpecifies whether VoltDB uses lowercase table and column names or not. By default, VoltDB issues SQL statements using uppercase names. However, some databases (such as PostgreSQL) are case sensitive. When this property is set to true, VoltDB uses all lowercase names rather than uppercase. The default is false.
ignoregenerationstrue, falseSpecifies whether a unique ID for the generation of the database is included as part of the output table name(s). The generation ID changes each time a database restarts or the database schema is updated. The default is false.
skipinternalstrue, falseSpecifies whether to include six columns of VoltDB metadata (such as transaction ID and timestamp) in the output. If you specify skipinternals as true, the output contains only the exported stream data. The default is false.

*Required


15.3.5. The Kafka Export Connector

The Kafka connector receives serialized data from the export sources and writes it to a message queue using the Apache Kafka version 0.10.2 protocols. Apache Kafka is a distributed messaging service that lets you set up message queues which are written to and read from by "producers" and "consumers", respectively. In the Apache Kafka model, VoltDB export acts as a "producer" capable of writing to any Kafka service using version 0.10.2 or later.

Before using the Kafka connector, we strongly recommend reading the Kafka documentation and becoming familiar with the software, since you will need to set up a Kafka service and appropriate "consumer" clients to make use of VoltDB's Kafka export functionality. The instructions in this section assume a working knowledge of Kafka and the Kafka operational model.

When the Kafka connector receives data from the VoltDB export sources, it establishes a connection to the Kafka messaging service as a Kafka producer. It then writes records to Kafka topics based on the VoltDB stream or table name and certain export connector properties.

The majority of the Kafka export properties are identical in both in name and content to the Kafka producer properties listed in the Kafka documentation. All but one of these properties are optional for the Kafka connector and will use the standard Kafka default value. For example, if you do not specify the queue.buffering.max.ms property it defaults to 5000 milliseconds.

The only required property is bootstrap.servers, which lists the Kafka servers that the VoltDB export connector should connect to. You must include this property so VoltDB knows where to send the export data. Specify each server by its IP address (or hostname) and port; for example, myserver:7777. If there are multiple servers in the list, separate them with commas.

In addition to the standard Kafka producer properties, there are several custom properties specific to VoltDB. The properties binaryencoding, skipinternals, and timezone affect the format of the data. The topic.prefix and topic.key properties affect how the data is written to Kafka.

The topic.prefix property specifies the text that precedes the stream or table name when constructing the Kafka topic. If you do not specify a prefix, it defaults to "voltdbexport". Alternately, you can map individual sources to topics using the topic.key property. In the topic.key property you associate a VoltDB export source name with the corresponding Kafka topic as a named pair separated by a period (.). Multiple named pairs are separated by commas (,). For example:

Employee.EmpTopic,Company.CoTopic,Enterprise.EntTopic

Any mappings in the topic.key property override the automated topic name specified by topic.prefix.

Note that unless you configure the Kafka brokers with the auto.create.topics.enable property set to true, you must create the topics for every export source manually before starting the export process. Enabling auto-creation of topics when setting up the Kafka brokers is recommended.

When configuring the Kafka export connector, it is important to understand the relationship between synchronous versus asynchronous processing and its effect on database latency. If the export data is sent asynchronously, the impact of export on the database is reduced, since the export connector does not wait for the Kafka infrastructure to respond. However, with asynchronous processing, VoltDB is not able to resend the data if the message fails after it is sent.

If export to Kafka is done synchronously, the export connector waits for acknowledgement of each message sent to Kafka before processing the next packet. This allows the connector to resend any packets that fail. The drawback to synchronous processing is that on a heavily loaded database, the latency it introduces means export may not be able to keep up with the influx of export data and and have to write to overflow.

You specify the level of synchronicity and durability of the connection using the Kafka acks property. Set acks to "0" for asynchronous processing, "1" for synchronous delivery to the Kafka broker, or "all" to ensure durability on the Kafka broker. See the Kafka documentation for more information.

VoltDB guarantees that at least one copy of all export data is sent by the export connector. But when operating in asynchronous mode, the Kafka connector cannot guarantee that the packet is actually received and accepted by the Kafka broker. By operating in synchronous mode, VoltDB can catch errors returned by the Kafka broker and resend any failed packets. However, you pay the penalty of additional latency and possible export overflow.

Finally, the actual export data is sent to Kafka as a comma-separated values (CSV) formatted string. The message includes six columns of metadata (such as the transaction ID and timestamp) followed by the column values of the export stream.

Table 15.5, “Kafka Export Properties” lists the supported properties for the Kafka connector, including the standard Kafka producer properties and the VoltDB unique properties.

Table 15.5. Kafka Export Properties

PropertyAllowable ValuesDescription
bootstrap.servers*string

A comma-separated list of Kafka brokers (specified as IP-address:port-number). You can use metadata.broker.list as a synonym for bootstrap.servers.

acks0, 1, allSpecifies whether export is synchronous (1 or all) or asynchronous (0) and to what extent it ensures delivery. The default is all, which is recommended to avoid possibly losing messages when a Kafka server becomes unavailable during export. See the Kafka documentation of the producer properties for details.
acks.retry.timeoutintegerSpecifies how long, in milliseconds, the connector will wait for acknowledgment from Kafka for each packet. The retry timeout only applies if acknowledgements are enabled. That is, if the acks property is set greater than zero. The default timeout is 5,000 milliseconds. When the timeout is reached, the connector will resend the packet of messages.
binaryencodinghex, base64Specifies whether VARBINARY data is encoded in hexadecimal or BASE64 format. The default is hexadecimal.
skipinternalstrue, falseSpecifies whether to include six columns of VoltDB metadata (such as transaction ID and timestamp) in the output. If you specify skipinternals as true, the output contains only the exported stream data. The default is false.
timezonestringThe time zone to use when formatting the timestamp. Specify the time zone as a Java TimeZone identifier. For example, you can specify a continent and region ("America/New_York") or a time zone acronym ("EST"). The default is GMT.
topic.keystring

A set of named pairs each identifying a VoltDB source name and the corresponding Kafka topic name to which the data is written. Separate the names with a period (.) and the name pairs with a comma (,).

The specific source/topic mappings declared by topic.key override the automated topic names specified by topic.prefix.

topic.prefixstringThe prefix to use when constructing the topic name. Each row is sent to a topic identified by {prefix}{source-name}. The default prefix is "voltdbexport".
Kafka producer propertiesvarious

You can specify standard Kafka producer properties as export connector properties and their values will be passed to the Kafka interface. However, you cannot modify the property block.on.buffer.full.

*Required


15.3.6. The Elasticsearch Export Connector

The Elasticsearch connector receives serialized data from the export source and inserts it into an Elasticsearch server or cluster. Elasticsearch is an open-source full-text search engine built on top of Apache Lucene™. By exporting selected tables and streams from your VoltDB database to Elasticsearch you can perform extensive full-text searches on the data not possible with VoltDB alone.

Before using the Elasticsearch connector, we recommend reading the Elasticsearch documentation and becoming familiar with the software. The instructions in this section assume a working knowledge of Elasticsearch, its configuration and its capabilities.

The only required property when configuring Elasticsearch is the endpoint, which identifies the location of the Elasticsearch server and what index to use when inserting records into the target system. The structure of the Elasticsearch endpoint is the following:

<protocol>://<server>:<port>//<index-name>//<type-name>

For example, if the target Elasticsearch service is on the server esearch.lan using the default port (9200) and the exported records are being inserted into the employees index as documents of type person, the endpoint declaration would look like this:

<property name="endpoint">
   http://esearch.lan:9200/employees/person
</property>

You can use placeholders in the endpoint that are replaced at runtime with information from the export data, such as the source name (%t), the partition ID (%p), the export generation (%g), and the date and hour (%d). For example, to use the source name as the index name, the endpoint might look like the following:

<property name="endpoint">
   http://esearch.lan:9200/%t/person
</property>

When you export to Elasticsearch, the export connector creates the necessary index names and types in Elasticsearch (if they do not already exist) and inserts each record as a separate document with the appropriate metadata. Table 15.6, “Elasticsearch Export Properties” lists the supported properties for the Elasticsearch connector.

Table 15.6. Elasticsearch Export Properties

PropertyAllowable ValuesDescription
endpoint*string

Specifies the root URL of the RESTful interface for the Elasticsearch cluster to which you want to export the data. The endpoint should include the protocol, host name or IP address, port, and path. The path is assumed to include an index name and a type identifier.

The export connector will use the Elasticsearch RESTful API to communicate with the server and insert records into the specified locations. You can use placeholders to replace portions of the endpoint with data from the exported records at runtime, including the source name (%t), the partition ID (%p), the date and hour (%d), and the export generation (%g).

batch.modetrue, falseSpecifies whether to send multiple rows as a single request or send each export row separately. The default is true.
timezonestringThe time zone to use when formatting timestamps. Specify the time zone as a Java TimeZone identifier. For example, you can specify a continent and region ("America/New_York") or a time zone acronym ("EST"). The default is the local time zone.

*Required




[5] Initializing a root directory deletes any files in the command log and overflow directories. The snapshots directory is archived to a named subdirectory.