The HTTP connector receives the serialized data from the export streams and writes it out via HTTP requests. The connector is designed to be flexible enough to accommodate most potential targets. For example, the connector can be configured to send out individual records using a GET request or batch multiple records using POST and PUT requests. The connector also contains optimizations to support export to Hadoop via WebHDFS.
The HTTP connector is a general purpose export utility that can export to any number of destinations from simple messaging services to more complex REST APIs. The properties work together to create a consistent export process. However, it is important to understand how the properties interact to configure your export correctly. The four key properties you need to consider are:
batch.mode — whether data is exported in batches or one record at a time
method — the HTTP request method used to transmit the data
type — the format of the output
endpoint — the target HTTP URL to which export is written
The properties are described in detail in Table 15.2, “HTTP Export Properties”. This section explains the relationship between the properties.
There are essentially two types of HTTP export: batch mode and one record at a time. Batch mode is appropriate for exporting large volumes of data to targets such as Hadoop. Exporting one record at a time is less efficient for large volumes but can be very useful for writing intermittent messages to other services.
In batch mode, the data is exported using a POST or PUT method, where multiple records are combined in either
comma-separated value (CSV) or Avro format in the body of the request. When writing one record at a time, you can choose
whether to submit the HTTP request as a POST, PUT or GET (that is, as a querystring attached to the URL). When exporting
in batch mode, the method must be either POST
or PUT
and the type must be either
csv
or avro
. When exporting one record at a time, you can use the
GET
, POST
, or PUT
method, but the output type must be
form
.
Finally, the endpoint property specifies the target URL where data is being sent, using either the http: or https: protocol. Again, the endpoint must be compatible with the possible settings for the other properties. In particular, if the endpoint is a WebHDFS URL, batch mode must enabled.
The URL can also contain placeholders that are filled in at runtime with metadata associated with the export data. Each placeholder consists of a percent sign (%) and a single ASCII character. The following are the valid placeholders for the HTTP endpoint property:
Placeholder | Description |
---|---|
%t | The name of the VoltDB export stream. The stream name is inserted into the endpoint in all uppercase. |
%p | The VoltDB partition ID for the partition where the INSERT query to the export stream is executing. The partition ID is an integer value assigned by VoltDB internally and can be used to randomly partition data. For example, when exporting to webHDFS, the partition ID can be used to direct data to different HDFS files or directories. |
%g | The export generation. The generation is an identifier assigned by VoltDB. The generation increments each time the database starts or the database schema is modified in any way. |
%d | The date and hour of the current export period. Applicable to WebHDFS export only. This placeholder identifies the start of each period and the replacement value remains the same until the period ends, at which point the date and hour is reset for the new period. You can use this placeholder to "roll over"
WebHDFS export destination files on a regular basis, as defined by the |
When exporting in batch mode, the endpoint must contain at least one instance each of the %t, %p, and %g placeholders. However, beyond that requirement, it can contain as many placeholders as desired and in any order. When not in batch mode, use of the placeholders are optional.
Table 15.2, “HTTP Export Properties” describes the supported properties for the HTTP connector.
Table 15.2. HTTP Export Properties
Property | Allowable Values | Description |
---|---|---|
endpoint* | string | Specifies the target URL. The endpoint can contain placeholders for inserting the stream name (%t), the partition ID (%p), the date and hour (%d), and the export generation (%g). |
avro.compress | true, false | Specifies whether Avro output is compressed or not. The default is false and this property is ignored if the type is not Avro. |
avro.schema.location | string | Specifies the location where the Avro schema will be written. The schema location can be either an absolute
path name on the local database server or a webHDFS URL and must include at least one instance of the placeholder
for the stream name (%t). Optionally it can contain other instances of both %t and %g. The default location for
the Avro schema is the file path export/avro/%t_avro_schema.json on the database server under
the voltdbroot directory. This property is ignored if the type is not Avro. |
batch.mode | true, false | Specifies whether to send multiple rows as a single request or send each export row separately. The default is true. Batch mode must be enabled for WebHDFS export. |
httpfs.enable | true, false | Specifies that the target of WebHDFS export is an Apache HttpFS (Hadoop HDFS over HTTP) server. This property must be set to true when exporting webHDFS to HttpFS targets. |
kerberos.enable | true, false | Specifies whether Kerberos authentication is used when connecting to a WebHDFS endpoint. This property is only valid when connecting to WebHDFS servers and is false by default. |
method | get, post, put | Specifies the HTTP method for transmitting the export data. The default method is POST. For WebHDFS export, this property is ignored. |
period | Integer | Specifies the frequency, in hours, for "rolling" the WebHDFS output date and time. The default frequency is every hour (1). For WebHDFS export only. |
timezone | string | The time zone to use when formatting the timestamp. Specify the time zone as a Java timezone identifier. The default is the local time zone. |
type | csv, avro, form | Specifies the output format. If batch.mode is true, the default type is CSV. If batch.mode is false, the default and only allowable value for type is form. Avro format is supported for WebHDFS export only (see Section 15.7.2, “Exporting to Hadoop via WebHDFS” for details.) |
*Required |
As mentioned earlier, the HTTP connector contains special optimizations to support exporting data to Hadoop via the WebHDFS protocol. If the endpoint property contains a WebHDFS URL (identified by the URL path component starting with the string "/webhdfs/v1/"), special rules apply.
First, for WebHDFS URLs, the batch.mode property must be enabled. Also, the endpoint must have at least one instance each of the stream name (%t), the partition ID (%p), and the export generation (%g) placeholders and those placeholders must be part of the URL path, not the domain or querystring.
Next, the method property is ignored. For WebHDFS, the HTTP connector uses a combination of POST, PUT, and GET requests to perform the necessary operations using the WebHDFS REST API.
For example, The following configuration file excerpt exports stream data to WebHDFS using the HTTP connector and writing each stream to a separate directory, with separate files based on the partition ID, generation, and period timestamp, rolling over every 2 hours:
<export> <configuration target="hadoop" enabled="true" type="http"> <property name="endpoint"> http://myhadoopsvr/webhdfs/v1/%t/data%p-%g.%d.csv </property> <property name="batch.mode">true</property> <property name="period">2</property> </configuration> </export>
Note that the HTTP connector will create any directories or files in the WebHDFS endpoint path that do not currently exist and then append the data to those files, using the POST or PUT method as appropriate for the WebHDFS REST API.
You also have a choice between two formats for the export data when using WebHDFS: comma-separated values (CSV) and
Apache Avro™ format. By default, data is written as CSV data with each record on a separate line and batches of records
attached as the contents of the HTTP request. However, you can choose to set the output format to Avro by setting the
type
property, as in the following example:
<export>
<configuration target="hadoop" enabled="true" type="http">
<property name="endpoint">
http://myhadoopsvr/webhdfs/v1/%t/data%p-%g.%d.avro
</property>
<property name="type">avro</property>
<property name="avro.compress">true</property>
<property name="avro.schema.location">
http://myhadoopsvr/webhdfs/v1/%t/schema.json
</property>
</configuration>
</export>
Avro is a data serialization system that includes a binary format that is used natively by Hadoop utilities such as
Pig and Hive. Because it is a binary format, Avro data takes up less network bandwidth than text-based formats such as
CSV. In addition, you can choose to compress the data even further by setting the avro.compress
property to true, as in the previous example.
When you select Avro as the output format, VoltDB writes out an accompanying schema definition as a JSON document. For compatibility purposes, the stream name and columns names are converted, removing underscores and changing the resulting words to lowercase with initial capital letters (sometimes called "camelcase"). The stream name is given an initial capital letter, while columns names start with a lowercase letter. For example, the stream EMPLOYEE_DATA and its column named EMPLOYEE_iD would be converted to EmployeeData and employeeId in the Avro schema.
By default, the Avro schema is written to a local file on the VoltDB database server. However, you can specify an
alternate location, including a webHDFS URL. So, for example, you can store the schema in the same HDFS repository as the
data by setting the avro.schema.location
property, as shown in the preceding example.
See the Apache Avro web site for more details on the Avro format.
If the WebHDFS service to which you are exporting data is configured to use Kerberos security, the VoltDB servers must be able to authenticate using Kerberos as well. To do this, you must perform the following two extra steps:
Configure Kerberos security for the VoltDB cluster itself
Enable Kerberos authentication in the export configuration
The first step is to configure the VoltDB servers to use Kerberos as described in Section 12.8, “Integrating Kerberos Security with VoltDB”. Because use of Kerberos authentication for VoltDB security changes how the clients connect to the database cluster, It is best to set up, enable, and test Kerberos authentication first to ensure your client applications work properly in this environment before trying to enable Kerberos export as well.
Once you have Kerberos authentication working for the VoltDB cluster, you can enable Kerberos authentication in the
configuration of the WebHDFS export target as well. Enabling Kerberos authentication in the HTTP connector only requires
one additional property, kerberos.enable
, to be set. To use Kerberos authentication, set the property
to "true". For example:
<export>
<configuration target="hadoop" enabled="true" type="http">
<property name="endpoint">
http://myhadoopsvr/webhdfs/v1/%t/data%p-%g.%d.csv
</property>
<property name="type">csv</property>
<property name="kerberos.enable">true</property>
</configuration>
</export>
Note that Kerberos authentication is only supported for WebHDFS endpoints.