15.7. The HTTP Export Connector

Documentation

VoltDB Home » Documentation » Using VoltDB

15.7. The HTTP Export Connector

The HTTP connector receives the serialized data from the export streams and writes it out via HTTP requests. The connector is designed to be flexible enough to accommodate most potential targets. For example, the connector can be configured to send out individual records using a GET request or batch multiple records using POST and PUT requests. The connector also contains optimizations to support export to Hadoop via WebHDFS.

15.7.1. Understanding HTTP Properties

The HTTP connector is a general purpose export utility that can export to any number of destinations from simple messaging services to more complex REST APIs. The properties work together to create a consistent export process. However, it is important to understand how the properties interact to configure your export correctly. The four key properties you need to consider are:

  • batch.mode — whether data is exported in batches or one record at a time

  • method — the HTTP request method used to transmit the data

  • type — the format of the output

  • endpoint — the target HTTP URL to which export is written

The properties are described in detail in Table 15.2, “HTTP Export Properties”. This section explains the relationship between the properties.

There are essentially two types of HTTP export: batch mode and one record at a time. Batch mode is appropriate for exporting large volumes of data to targets such as Hadoop. Exporting one record at a time is less efficient for large volumes but can be very useful for writing intermittent messages to other services.

In batch mode, the data is exported using a POST or PUT method, where multiple records are combined in either comma-separated value (CSV) or Avro format in the body of the request. When writing one record at a time, you can choose whether to submit the HTTP request as a POST, PUT or GET (that is, as a querystring attached to the URL). When exporting in batch mode, the method must be either POST or PUT and the type must be either csv or avro. When exporting one record at a time, you can use the GET, POST, or PUT method, but the output type must be form.

Finally, the endpoint property specifies the target URL where data is being sent, using either the http: or https: protocol. Again, the endpoint must be compatible with the possible settings for the other properties. In particular, if the endpoint is a WebHDFS URL, batch mode must enabled.

The URL can also contain placeholders that are filled in at runtime with metadata associated with the export data. Each placeholder consists of a percent sign (%) and a single ASCII character. The following are the valid placeholders for the HTTP endpoint property:

PlaceholderDescription
%tThe name of the VoltDB export stream. The stream name is inserted into the endpoint in all uppercase.
%pThe VoltDB partition ID for the partition where the INSERT query to the export stream is executing. The partition ID is an integer value assigned by VoltDB internally and can be used to randomly partition data. For example, when exporting to webHDFS, the partition ID can be used to direct data to different HDFS files or directories.
%gThe export generation. The generation is an identifier assigned by VoltDB. The generation increments each time the database starts or the database schema is modified in any way.
%d

The date and hour of the current export period. Applicable to WebHDFS export only. This placeholder identifies the start of each period and the replacement value remains the same until the period ends, at which point the date and hour is reset for the new period.

You can use this placeholder to "roll over" WebHDFS export destination files on a regular basis, as defined by the period property. The period property defaults to one hour.

When exporting in batch mode, the endpoint must contain at least one instance each of the %t, %p, and %g placeholders. However, beyond that requirement, it can contain as many placeholders as desired and in any order. When not in batch mode, use of the placeholders are optional.

Table 15.2, “HTTP Export Properties” describes the supported properties for the HTTP connector.

Table 15.2. HTTP Export Properties

PropertyAllowable ValuesDescription
endpoint*stringSpecifies the target URL. The endpoint can contain placeholders for inserting the stream name (%t), the partition ID (%p), the date and hour (%d), and the export generation (%g).
avro.compresstrue, falseSpecifies whether Avro output is compressed or not. The default is false and this property is ignored if the type is not Avro.
avro.schema.locationstringSpecifies the location where the Avro schema will be written. The schema location can be either an absolute path name on the local database server or a webHDFS URL and must include at least one instance of the placeholder for the stream name (%t). Optionally it can contain other instances of both %t and %g. The default location for the Avro schema is the file path export/avro/%t_avro_schema.json on the database server under the voltdbroot directory. This property is ignored if the type is not Avro.
batch.modetrue, falseSpecifies whether to send multiple rows as a single request or send each export row separately. The default is true. Batch mode must be enabled for WebHDFS export.
httpfs.enabletrue, falseSpecifies that the target of WebHDFS export is an Apache HttpFS (Hadoop HDFS over HTTP) server. This property must be set to true when exporting webHDFS to HttpFS targets.
kerberos.enabletrue, falseSpecifies whether Kerberos authentication is used when connecting to a WebHDFS endpoint. This property is only valid when connecting to WebHDFS servers and is false by default.
methodget, post, putSpecifies the HTTP method for transmitting the export data. The default method is POST. For WebHDFS export, this property is ignored.
periodIntegerSpecifies the frequency, in hours, for "rolling" the WebHDFS output date and time. The default frequency is every hour (1). For WebHDFS export only.
timezonestringThe time zone to use when formatting the timestamp. Specify the time zone as a Java timezone identifier. The default is the local time zone.
typecsv, avro, formSpecifies the output format. If batch.mode is true, the default type is CSV. If batch.mode is false, the default and only allowable value for type is form. Avro format is supported for WebHDFS export only (see Section 15.7.2, “Exporting to Hadoop via WebHDFS” for details.)

*Required


15.7.2. Exporting to Hadoop via WebHDFS

As mentioned earlier, the HTTP connector contains special optimizations to support exporting data to Hadoop via the WebHDFS protocol. If the endpoint property contains a WebHDFS URL (identified by the URL path component starting with the string "/webhdfs/v1/"), special rules apply.

First, for WebHDFS URLs, the batch.mode property must be enabled. Also, the endpoint must have at least one instance each of the stream name (%t), the partition ID (%p), and the export generation (%g) placeholders and those placeholders must be part of the URL path, not the domain or querystring.

Next, the method property is ignored. For WebHDFS, the HTTP connector uses a combination of POST, PUT, and GET requests to perform the necessary operations using the WebHDFS REST API.

For example, The following configuration file excerpt exports stream data to WebHDFS using the HTTP connector and writing each stream to a separate directory, with separate files based on the partition ID, generation, and period timestamp, rolling over every 2 hours:

<export>
   <configuration target="hadoop" enabled="true" type="http">
     <property name="endpoint">
        http://myhadoopsvr/webhdfs/v1/%t/data%p-%g.%d.csv
     </property>
     <property name="batch.mode">true</property>
     <property name="period">2</property>
  </configuration>
</export>

Note that the HTTP connector will create any directories or files in the WebHDFS endpoint path that do not currently exist and then append the data to those files, using the POST or PUT method as appropriate for the WebHDFS REST API.

You also have a choice between two formats for the export data when using WebHDFS: comma-separated values (CSV) and Apache Avro™ format. By default, data is written as CSV data with each record on a separate line and batches of records attached as the contents of the HTTP request. However, you can choose to set the output format to Avro by setting the type property, as in the following example:

<export>
   <configuration target="hadoop" enabled="true" type="http">
     <property name="endpoint">
       http://myhadoopsvr/webhdfs/v1/%t/data%p-%g.%d.avro
     </property>
     <property name="type">avro</property>
     <property name="avro.compress">true</property>
     <property name="avro.schema.location">
       http://myhadoopsvr/webhdfs/v1/%t/schema.json
     </property>
  </configuration>
</export>

Avro is a data serialization system that includes a binary format that is used natively by Hadoop utilities such as Pig and Hive. Because it is a binary format, Avro data takes up less network bandwidth than text-based formats such as CSV. In addition, you can choose to compress the data even further by setting the avro.compress property to true, as in the previous example.

When you select Avro as the output format, VoltDB writes out an accompanying schema definition as a JSON document. For compatibility purposes, the stream name and columns names are converted, removing underscores and changing the resulting words to lowercase with initial capital letters (sometimes called "camelcase"). The stream name is given an initial capital letter, while columns names start with a lowercase letter. For example, the stream EMPLOYEE_DATA and its column named EMPLOYEE_iD would be converted to EmployeeData and employeeId in the Avro schema.

By default, the Avro schema is written to a local file on the VoltDB database server. However, you can specify an alternate location, including a webHDFS URL. So, for example, you can store the schema in the same HDFS repository as the data by setting the avro.schema.location property, as shown in the preceding example.

See the Apache Avro web site for more details on the Avro format.

15.7.3. Exporting to Hadoop Using Kerberos Security

If the WebHDFS service to which you are exporting data is configured to use Kerberos security, the VoltDB servers must be able to authenticate using Kerberos as well. To do this, you must perform the following two extra steps:

  • Configure Kerberos security for the VoltDB cluster itself

  • Enable Kerberos authentication in the export configuration

The first step is to configure the VoltDB servers to use Kerberos as described in Section 12.8, “Integrating Kerberos Security with VoltDB”. Because use of Kerberos authentication for VoltDB security changes how the clients connect to the database cluster, It is best to set up, enable, and test Kerberos authentication first to ensure your client applications work properly in this environment before trying to enable Kerberos export as well.

Once you have Kerberos authentication working for the VoltDB cluster, you can enable Kerberos authentication in the configuration of the WebHDFS export target as well. Enabling Kerberos authentication in the HTTP connector only requires one additional property, kerberos.enable, to be set. To use Kerberos authentication, set the property to "true". For example:

<export>
   <configuration target="hadoop" enabled="true" type="http">
     <property name="endpoint">
       http://myhadoopsvr/webhdfs/v1/%t/data%p-%g.%d.csv
     </property>
     <property name="type">csv</property>
     <property name="kerberos.enable">true</property>
  </configuration>
</export>

Note that Kerberos authentication is only supported for WebHDFS endpoints.