Aggregate functions are more complex than scalar functions because the program code does not receive a predefined number of arguments, it must handle a variable number of rows. It must also be able to aggregate results from multiple partitions in the case of partitioned tables. To make this possible, user-defined aggregate functions are written as a Java class with four required methods:
start()
— This method initializes the function and is called once when the function is
first invoked.
assemble( arg )
— This method processes the argument to the function and is called once
for each record matching the constraints defined by the query.
combine( class-instance )
— For partitioned tables in a multi-partition procedure, this
method combines the results of one partition into the results of another and is called iteratively until the results
from all of the partitions are aggregated.
end()
— This method finalizes the function, returning the function result. It is called
once after all of the input has been processed and, if appropriate, the partitions combined.
Let's take a look at a simple example of an aggregate function to understand how this works. The MAX() built-in function returns the maximum value from a set of inputs and CHAR_LENGTH() returns the length of a text string. So you can combine them to find the length of the longest string. But what if you want to know what is the longest word in the set of strings? For this you need to write your own aggregate function.
First, the function must import and implement the VoltUDAggregate
interface. Within the class, the
datatype of the assemble
method's parameter defines the type of the function's argument. For example, if
you declare the method as having one String parameter, the function will accept one argument, a VoltDB VARCHAR expression.
Similarly, the datatype of the end
method determines the datatype that the function returns. See the
appendix on Datatype compatibility in the
Using VoltDB manual for details on the
mapping of Java and VoltDB datatypes.
So the Java source code of a longest word function might look like the following:
package fadvisor.functions; import java.io.Serializable; import org.voltdb.VoltUDAggregate; public class LongestWord implements Serializable, VoltUDAggregate<String, LongestWord> { private String longest; public void start() { /* Initialize value. */ longest = ""; } public void assemble (String txt) { /* Break into uppercase words, then check each word. */ String[] words = txt.toUpperCase().split("\\W+"); for (int i=0;i<words.length;i++) { if (words[i].length() > longest.length()) { longest = words[i]; } else { /* If the same length, alphabetize. */ if ( words[i].length() == longest.length() & longest.compareTo(words[i]) > 0 ) { longest = words[i]; } } } } public void combine(LongestWord other) { /* Merge the results from each partition. */ if (other.longest.length() > longest.length()) { longest = other.longest; } else { if ( other.longest.length() == longest.length() & longest.compareTo(other.longest) > 0 ) { longest = other.longest; } } } public String end() { /* Return the result. */ return longest; } }
The bulk of the assemble
and combine
methods perform the actual task the
function is designed for. The key point is to make sure the end method returns the appropriate datatype object that
correlates to the VoltDB datatype you want the function to return in the SQL statement. In this example, the method is
declared as returning a String
, matching the VoltDB VARCHAR type.