16.4. VoltDB Import Connectors

Just as VoltDB can export data from selected streams and tables to external targets, it supports importing data into selected tables from external sources. Import works in two ways:

Bulk loading data using one of several standalone utilities VoltDB provides. These data loaders support multiple standard input protocols and can be run from any server, even remotely from the database itself.
Streaming import as part of the database server process. For data that is imported on an ongoing basis, use of the built-in import functionality ensures that import starts and stops with the database.

The following sections discuss these two approaches to data import.

16.4.1. Bulk Loading Data Using VoltDB Standalone Utilities

Often, when migrating data from one database to another or when pre-loading a set of data into VoltDB as a starting point, you just want to perform the import once and then use the data natively within VoltDB. For these one-time uses, or when you prefer to manage the import process externally, VoltDB provides separate data loader utilities.

Each data loader supports a different source format. You can load data from text files — such as comma-separated value (CSV) files — using the csvloader utility. You can load data from another JDBC-compliant database using the jdbcloader utility. Or you can load data from a streaming message service with the Kafka loader utility, kafkaloader.

All of the data loaders operate in much the same way. For each utility you specify the source for the import and either a table that the data will be loaded into or a stored procedure that will be used to load the data. So, for example, to load records from a CSV file named staff.csv into the table EMPLOYEES, the command might be the following:

$ csvloader employees --file=staff.csv

If instead you are copying the data from a JDBC-compliant database, the command might look like this:

$ jdbcloader employees \
     --jdbcurl=jdbc:postgresql://remotesvr/corphr \
     --jdbctable=employees \
     --jdbcdriver=org.postgresql.Driver

Each utility has arguments unique to the data source (such as --jdbcurl) that allow you to properly configure and connect to the source. See the description of each utility in Appendix D, VoltDB CLI Commands for details.

16.4.2. Streaming Import Using Built-in Import Features

If importing data is an ongoing business process, rather than a one-time event, then it is desirable to make it an integral part of the database system. This can be done by building a custom application to push data into VoltDB using one of its standard APIs, such as the JDBC interface. Or you can take advantage of VoltDB's built-in import infrastructure.

The built-in importers work in much the same way as the data loading utilities, where incoming data is written into one or more database tables using an existing stored procedure. The difference is that the built-in importers start automatically whenever the database starts and stop when the database stops, making import an integral part of the database process.

You set up the built-in importers in the configuration the same way you configure export connections. Within the import property, you declare each import stream using separate configurations list elements. You use subproperties to specify the type and format of data being imported and whether the import configuration is enabled or not. Provide information required by the specific importer and/or formatter within the properties list. For example:

deployment:
  import:
    configurations:
      type: kafka
      format: csv
      enabled: true
      properties:
      - name: brokers
        value: kafkasvr:9092
      - name: topics
        value: employees
      - name: procedure
        value: EMPLOYEE.insert

When the database starts, the import infrastructure starts any enabled configurations. If you are importing multiple streams to separate tables through separate procedures, you must include multiple configurations, even if they come from the same source. For example, the following configuration imports data from two Kafka topics from the same Kafka servers into separate VoltDB tables.

deployment:
  import:
    configurations:
    - nickname: employees
      type: kafka
      format: csv
      enabled: true
      properties:
      - name: brokers
        value: kafkasvr:9092
      - name: topics
        value: employees
      - name: procedure
        value: EMPLOYEE.insert
    - nickname: managers
      type: kafka
      format: csv
      enabled: true
      properties:
      - name: brokers
        value: kafkasvr:9092
      - name: topics
        value: managers
      - name: procedure
        value: MANAGER.insert

Note the addition of the nickname property to the configuration. Unlike export where the target property provides a unique identifier, there are no unique subproperties in the import configuration. The nickname property lets you assign a unique name to each import connector. This way you can modify the settings for individual connectors while the database is running using the voltadmin set command. For example, the following command disables the managers import connector:

$ voltadmin set deployment.import.configurations[managers].enabled=false

Without the nickname, you would have to update the entire database configuration to modify just one import connector.

VoltDB currently provides support for a Kafka importer. It also supports two import formats: comma-separated values (csv) and tab-separated values (tsv). Comma-separated values are the default format. So if you are using CSV-formatted input, you can leave out the format attribute, as in the preceding example. The following sections describe the Kafka importer and the CSV/TSV formatter in more detail.

16.4.3. The Kafka Importer

The Kafka importer connects to the specified Kafka messaging service and imports one or more Kafka topics and writes the records into the VoltDB database. The data is decoded according to the specified format — comma-separated values (CSV) by default — and is inserted into the VoltDB database using the specified stored procedure.

The Kafka importer supports Kafka version 0.10 or later. You must specify at least the following properties for each configuration:

brokers — Identifies one or more Kafka brokers. That is, servers hosting the Kafka service and desired topics. Specify a single server or a comma-separated list of brokers.
topics — Identifies the Kafka topics that will be imported. The property value can be a single topic name or a comma-separated list of topics.
procedure — Identifies the stored procedure that is invoked to insert the records into the VoltDB database.

When import starts, the importer first checks to make sure the specified stored procedure exists in the database schema. If not (for example, when you first create a database and before a schema is loaded), the importer issues periodic warnings to the console.

Once the specified stored procedure is declared, the importer looks for the specified Kafka brokers and topics. If the specified brokers cannot be found or the specified topics do not exist on the brokers, the importer reports an error and stops. You will need to restart import once this error condition is corrected. You can restart import using any of the following methods:

Stop and restart the database
Pause and resume the database using the voltadmin pause and voltadmin resume commands
Update the configuration using the voltadmin update command or the web-based Volt Management Center

If the brokers are found and the topics exist, the importer starts fetching data from the Kafka topics and submitting it to the stored procedure to insert into the database. In the simplest case, you can use the default insert procedure for a table to insert records into a single table. For more complex data you can write your own import stored procedure to interpret the data and insert it into the appropriate table(s).

Table 16.7, “Kafka Import Properties” lists the allowable properties for the Kafka importer. You can also specify properties associated with the formatter, as described in Table 16.8, “CSV and TSV Formatter Properties”.

Table 16.7. Kafka Import Properties

Property	Allowable Values	Description
brokers^*	string	A comma-separated list of Kafka brokers.
procedure^*	string	The stored procedure to invoke to insert the incoming data into the database.
topics^*	string	A comma-separated list of Kafka topics.
commit.policy	integer	Because the importer performs two distinct tasks — retrieving records from Kafka and then inserting them into VoltDB — Kafka's automated tracking of the current offset may not match what records are successfully inserted into the database. Therefore, by default, the importer uses a manual commit policy to ensure the Kafka offset matches the completed inserts. Use of the default commit policy is recommended. However, you can, if you choose, use Kafka's automated commit policy by specifying a commit interval, in milliseconds, using this property.
groupid	string	A user-defined name for the group that the client belongs to. Kafka maintains a single pointer for the current position within the stream for all clients in the same group. The default group ID is "voltdb". In the rare case where you have two or more databases importing data from the same Kafka brokers and topics, be sure to set this property to give each database a unique group ID and avoid the databases interfering with each other.
fetch.max.bytes heartbeat.interval.ms max.partition.fetch.bytes max.poll.interval.ms max.poll.records request.timeout.ms session.timeout.ms	string	These Kafka consumer properties are supported as import properties. See the Kafka 0.11 documentation for details.
^*Required