VoltDB and Machine Learning

Documentation

VoltDB Home » Documentation » VoltDB and Machine Learning

VoltDB and Machine Learning

V1.0

October, 2018


1. Introduction

The VoltDB Machine Learning utility simplifies the process of implementing machine learning models in VoltDB. The initial offering includes the model loader, voltml, that converts a machine learning model into an executable process as a VoltDB user-defined function (UDF). The utility currently supports PMML exported models only. Support for other types of machine learning systems (H20, for example) may be added in the future.

There are two steps to preparing your systems to use the Machine Learning utility:

  1. Download and install the utility.

  2. Prepare your database servers by copying support JAR files into the /lib folders where VoltDB is installed.

Once installed, using the utility is a very simple two-step process:

  1. Use the voltml utility to compile the PMML model and create a JAR file containing the user-defined function (UDF).

  2. Use the sqlcmd utility to load and define the UDF.

The following sections describe both of these processes.

2. Getting Started

The VoltDB Machine Learning utility comes as a compressed tar file. Before you can use the utility, you need to download the kit, install it, and prepare the servers. The following sections describe these steps. Note that these preliminary steps only need to be completed once.

2.1. Downloading and Installing the Kit

The first step is to download and unpack the kit:

  1. Copy the kit from the VoltDB website: https://downloads.voltactivedata.com/technologies/client/voltdb-ml-latest.tar.gz

  2. Unpack the kit using the tar utility.

For example, the following command unpacks the utility to a folder in your home directory:

$ tar -zxvf voltdb-ml-v1.0.tar.gz -C $HOME

You can run the voltml utility by itself directly from the installation folder, with or without a full VoltDB database installation. Alternately, you can copy the files for the machine learning utility into an existing VoltDB installation. This option is useful if you install voltml on an existing VoltDB server because:

  • It eliminates the need to copy the support files to at least one server

  • It allows you to use the voltml command anywhere on the system without any prefix, since it gets installed into the VoltDB /bin directory which should already be in your PATH.

To install voltml into an existing VoltDB installation after unpacking the kit, simply copy the contents of the /bin and /lib folders into the /bin and /lib folders of the VoltDB installation. For example, if voltml is unpacked into ~/voltdb-ml-v1.0 and VoltDB is installed in ~/voltdb, you can add voltml to the VoltDB installation with the following shell commands:

$ cd ~/voltdb-ml-v1.0
$ cp -r ./lib ~/voltdb/
$ cp -r ./bin ~/voltdb/

2.2. Preparing the Servers

The second step is to prepare the VoltDB servers that will be using the resulting user-defined function. To do this, copy the following files from the /lib directory where voltml is installed into the /lib directory where VoltDB is installed on all of the servers that will be participating in the VoltDB database cluster:

  • pmml-evaluator-1.4.1.jar

  • pmml-model-1.4.1.jar

  • guava-24.0-jre.jar

For example, if voltml is installed in ~/voltml on one node of a three node VoltDB cluster (svr1, svr2, and svr3) where VoltDB is installed in ~/voltdb, the following command prepares the servers:

$ cd ~/voltml/lib
$ cp * ~/voltdb/lib/
$ scp * srv2:voltdb/lib/
$ scp * srv3:voltdb/lib/

Note that you must copy the files to all of the VoltDB servers in the cluster before starting the database.

3. Loading PMML Models into VoltDB

The next step is to compile the model into a VoltDB UDF and load it into your database. To compile machine learning models using the voltml utility, you must have the Java JDK version 7 or 8 installed. To execute the resulting user-defined functions, the VoltDB database cluster must be running VoltDB version 7.6 or later.

3.1. Running the voltml Utility

To compile your PMML model, use the voltml command. If you installed the utility separately, you can set default to the directory where it is installed and run the command from the /bin directory as bin/voltml. If you copied the installation files into an existing VoltDB installation, so the command is in your PATH, you can use the voltml command from anywhere on the system.

In either case, enter the command and specify the location of your model PMML file as an argument. For example:

$ cd ~/voltdb-ml-1.0
$ bin/voltml mymodel.pmml

Or:

$ cd /dev/workspace
$ voltml mymodel.pmml

By default, the utility compiles the model and creates a JAR file, with the same name as the model file, containing the JAVA class for a user-defined function (UDF). The UDF uses the same name, "mymodel" in this example, for both the class and the method. For example "Mymodel.mymodel".

If you want the UDF to have a different name, you can use the -n or --name flag to specify an alternate name. For example, the following command creates a class and method named "Colormatch.colormatch". Note that class and Method names are case sensitive. voltml starts by converting the implied name to all lower case and then puts an initial capital on the class name.

$ voltml --name="colormatch" mymodel.pmml

You can also rename the output file using the -o or --outputjar flag. For example, the following example creates the UDF class as Mymodel, but creates the output JAR file as myudf.jar:

$ voltml --outputjar="myudf.jar" mymodel.pmml

Table 1, “Arguments to the voltml Command” summarizes the allowable arguments to the voltml command.

Table 1. Arguments to the voltml Command

ArgumentDescription
-hDisplays help text explaining the usage and allowable arguments for the voltml command.
-n. --nameSpecifies the name of the resulting UDF class and method. The default is to use the name of the input file.
-o, -⁠-⁠outputjarSpecifies the name and, optionally, path of the output JAR file. The default is to create a file in the current working directory with the same name as the input file and a file extension of .jar.

3.2. Loading and Defining the UDF

Once you compile your model into a user-defined function (UDF), you are ready to load and define the UDF in the VoltDB database. Use the sqlcmd utility to perform this task using the LOAD CLASSES and CREATE FUNCTION statements that voltml displays when it completes its processing. You can just copy and paste the statements into sqlcmd:

$ sqlcmd
> LOAD CLASSES mymodel.jar;
> CREATE FUNCTION mymodel FROM METHOD Mymodel.mymodel;

4. Applying the Model to Your Data

Once your UDF is defined, your database is ready to start processing data using the model evaluator. The voltml UDF works like any other function within SQL, where you specify the function name and the necessary arguments as a list within parentheses after the name.

The number, order, and type of the arguments for the UDF are the same as the inputs declared in the PMML definition. For example, if you have a model called lifespan for a model that looks at the age, height, and weight of an individual, the following query evaluates the model against a selection of database entries:

SELECT state, AVG( lifespan(age,height,weight) )  
  FROM members   
     GROUP BY state ORDER BY state;