10.2. Writing a User-Defined Aggregate Function

Aggregate functions are more complex than scalar functions because the program code does not receive a predefined number of arguments, it must handle a variable number of rows. It must also be able to aggregate results from multiple partitions in the case of partitioned tables. To make this possible, user-defined aggregate functions are written as a Java class with four required methods:

start() — This method initializes the function and is called once when the function is first invoked.
assemble( arg ) — This method processes the argument to the function and is called once for each record matching the constraints defined by the query.
combine( class-instance ) — For partitioned tables in a multi-partition procedure, this method combines the results of one partition into the results of another and is called iteratively until the results from all of the partitions are aggregated.
end() — This method finalizes the function, returning the function result. It is called once after all of the input has been processed and, if appropriate, the partitions combined.

Let's take a look at a simple example of an aggregate function to understand how this works. The MAX() built-in function returns the maximum value from a set of inputs and CHAR_LENGTH() returns the length of a text string. So you can combine them to find the length of the longest string. But what if you want to know what is the longest word in the set of strings? For this you need to write your own aggregate function.

First, the function must import and implement the VoltUDAggregate interface. Within the class, the datatype of the assemble method's parameter defines the type of the function's argument. For example, if you declare the method as having one String parameter, the function will accept one argument, a VoltDB VARCHAR expression. Similarly, the datatype of the end method determines the datatype that the function returns. See the appendix on Datatype compatibility in the Using VoltDB manual for details on the mapping of Java and VoltDB datatypes.

So the Java source code of a longest word function might look like the following:

package fadvisor.functions;

import java.io.Serializable;
import org.voltdb.VoltUDAggregate;

public class LongestWord 
   implements Serializable, VoltUDAggregate<String, LongestWord> {

      private String longest;

      public void start() {
        /* Initialize value. */
        longest = "";
      }
      
      public void assemble (String txt) {
         /* Break into uppercase words, then check each word. */
         String[] words = txt.toUpperCase().split("\\W+");
         for (int i=0;i<words.length;i++) {
            if (words[i].length() > longest.length()) {
                longest = words[i];
            } else {
                /* If the same length, alphabetize. */
                if ( words[i].length() == longest.length() & 
                     longest.compareTo(words[i]) > 0 ) {
                     longest = words[i]; 
                }
            }
            
         }
      }
      
      public void combine(LongestWord other) {
         /* Merge the results from each partition. */
         if (other.longest.length() > longest.length()) {
            longest = other.longest;
         } else {
            if ( other.longest.length() == longest.length() & 
                 longest.compareTo(other.longest) > 0 ) {
                 longest = other.longest;
            }         
         }
      }
      
      public String end() {
         /* Return the result. */
        return longest;

      }
}

The bulk of the assemble and combine methods perform the actual task the function is designed for. The key point is to make sure the end method returns the appropriate datatype object that correlates to the VoltDB datatype you want the function to return in the SQL statement. In this example, the method is declared as returning a String, matching the VoltDB VARCHAR type.

Guide to Performance and Customization

10.2. Writing a User-Defined Aggregate Function