An important aspect of system sizing is planning for the memory required to support the application. Because VoltDB is an in-memory database, allocating sufficient memory is vital.
Section 3.3, “Sizing for Capacity” provides some simple equations for estimating the memory requirements of a prospective application. If you are already at the stage where the database schema is well-defined and want more precise measurements of potential memory usage, this chapter provides details about how memory gets allocated.
For VoltDB databases, there are three aspects of memory sizing the must be considered:
Understanding how much memory is required to store the data itself; that is, the contents of the database
Evaluating how that data is distributed across the cluster, based on the proportion of partitioned to replicated tables and the K-safety value
Determining the memory requirements of the server process
The sum of the estimated data requirements per server and the java heap size required by the server process provide the total memory requirement for each server.
To plan effectively for database capacity, you must know in advance both the structure of the data and the projected volume. This means you must have at least a preliminary database schema, including tables, columns, and indexes, as well as the expected number of rows for each table.
It is often useful to write this initial sizing information down, for example in a spreadsheet. Your planning may even allow for growth, assigning values for both the initial volume and projected long-term growth. For example, here is a simplified example of a spreadsheet for a database supporting a flight reservation system:
|Name||Type||Size||Initial Volume||Future Volume|
|- FlightByID||Hash Index||5,000||20,000|
|- FlightByDepartTime||Tree Index||5,000||20,000|
|- AirportByCode||Tree Index||10,000||10,000|
|- ReservByFlight||Hash Index||100,000||200,000|
|- CustomerByID||Hash Index||200,000||1,000,000|
|- CustomerByName||Tree Index||200,000||1,000,000|
Using the database schema, it is possible to calculate the size of the individual table records and indexes, which when multiplied by the volume projections gives a good estimate the the total memory needed to store the database contents. The following sections explain how to calculate the size column for individual table rows and indexes.
The size of individual table rows depends on the number and datatype of the columns in the table. For fixed-size datatypes, such as INTEGER and TIMESTAMP, the column is stored inline in the table record using the specified number of bytes. Table 4.1, “Memory Requirements For Tables By Datatype” specifies the length (in bytes) of fixed size datatypes.
For variable length datatypes, such as VARCHAR and VARBINARY, how the data is stored and, consequently, how much space it requires, depends on both the actual length of the data and the maximum possible length. If the maximum length is less than 64 bytes, the data is stored inline in the tuple as fixed-length data consuming the maximum number of bytes plus one for the length. So, for example, a VARCHAR(32 BYTES) column takes up 33 bytes, no matter how long the actual data is.
Note that VARCHAR columns can be declared in characters (the default) or bytes. For storage calculations, variable-length strings declared in characters are considered to consume 4 bytes for every character. In other words, a variable length string declared as VARCHAR(8) in characters consume the same amount of space as a string declared as VARCHAR(32 BYTES).
If the maximum length is 64 bytes or more, the data is stored in pooled memory rather than inline. To do this, there is an 8-byte pointer stored inline in the tuple, a 24-byte string reference object, and the space required to store the data itself in the pool. Within the pool, the data is stored as a 4-byte length, an 8-byte reverse pointer to the string reference object, and the data.
To complicate the calculation somewhat, data stored in pooled memory is not stored as arbitrary lengths. Instead, data is incremented to the smallest appropriate "pool size", where pool sizes are powers of 2 and intermediary values. In other words, pool sizes include 2, 4, 6 (2+4), 8, 12 (8+4), 16, 24 (8+16), 32 and so on up to a maximum of 1 megabyte for data plus 12 bytes for the pointer and length. For example, if the LastName column in the Customer table is defined as VARCHAR(32) (that is, a maximum length of 128 bytes) and the actual content is 95 bytes, the column consumes 160 bytes:
8 Inline pointer 24 String reference object 4 Data length 8 Reverse pointer 95 Data 107 128 Pool total / incremented to next pool size 160 Total
Note that if a variable length column is defined with a maximum length greater than or equal to 64 bytes, it is not stored inline, even if the actual contents is less than 64 bytes. Variable length columns are stored inline only if the maximum length is less than 64 bytes.
Table 4.1, “Memory Requirements For Tables By Datatype” summarizes the memory requirements for each datatype.
Table 4.1. Memory Requirements For Tables By Datatype
|Datatype||Size (in bytes)||Notes|
|VARCHAR (<64 bytes)||maximum size + 1||Stored inline|
|VARBINARY (<64 bytes)||maximum size + 1||Stored inline|
|VARCHAR (>= 64 bytes)||32 + (actual size + 12 + padding)||Pooled resource. Total size includes an 8-byte inline pointer, a 24-byte reference pointer, plus the pooled resource itself.|
|VARBINARY (>=64 bytes)||32 + (actual size + 12 + padding)||Same as VARCHAR.|
For tables with variable length columns less than 64 bytes, memory usage can be sized very accurately using the preceding table. However, for tables with variable length columns greater than 64 bytes, sizing is approximate at best. Besides the variability introduced by pooling, any sizing calculation must be based on an estimate of the average length of data in the variable columns.
For the safest and most conservative estimates, you can use the maximum length when calculating variable length columns. If, on the other hand, there are many variable length columns or you know the data will vary widely, you can use an estimated average or 90th percentile figure, to avoid over-estimating memory consumption.
Indexes are sized in a way similar to tables, where the size and number of the index columns determine the size of the index. However, the calculations for tree and hash indexes are distinct.
For tree indexes, which are the default index type, you can calculate the size of the individual index entries by adding up the size for each column in the index plus 40 bytes for overhead (pointers and lengths). The size of the columns are identical to the sizes when sizing tables, as described in Table 4.1, “Memory Requirements For Tables By Datatype”, with the exception of non-inlined binary data. For variable length columns equal to or greater than 64 bytes in length, the index only contains an 8-byte pointer; the data itself is not replicated.
So, for example, the CustomerByName index on the Customer table, which is a tree index containing the VARCHAR(32) fields LastName and FirstName, has a length of 56 bytes for each entry:
8 Pointer to LastName 8 Pointer to FirstName 40 Overhead 56 Total
The following equation summarizes how to calculate the size of a tree index.
(sum-of-column-sizes + 8 + 32) * rowcount
For hash indexes, the total size of the index is not a direct multiple of individual index entries. Hash indexes are stored as a hash table, the size of which is determined both by the size of the content and the number of entries. The size of the individual entries are calculated by the size of the individual columns plus 32 bytes of overhead. (Hash indexes support integer columns only, so there are no variable length keys.) But in addition to the individual entries, the index requires additional space for the hash, based on the total number of entries. This space is calculated (in bytes) as two times the number of rows, plus one and multiplied by eight.
The following equation summarizes how to calculate the size of a hash index.
(((2 * rowcount) + 1) * 8) + ((sum-of-column-sizes + 32) * rowcount)
Note that hash indexes do not grow linearly. Incrementally allocating small blocks of memory is not productive. Instead, when the index needs more space (because additional entries have been added), the index doubles in size all at one time. Similarly, the index does not shrink immediately if entries are removed.
In general, the space allocated to hash indexes doubles in size whenever the index usage reaches 75% of capacity and the memory allocation is cut in half if usage drops below 15%.
Using the preceding formulas it is possible to size the sample flight database mentioned earlier. For example, it is possible to size the individual rows of the Flight table based on the schema columns and datatypes. The following table demonstrates the sizing of the Flight table.
|Column||Datatype||Size in Bytes|
The same calculations can be done for the other tables and indexes. When combined with the expected volumes (described in Section 4.1, “Planning for Database Capacity”), you get a estimate for the total memory required for storing the database content of approximately 500 megabytes, as shown in the following table.
|Name||Type||Size||Final Volume||Total Size|
|- FlightByID||Hash Index||[*]36||20,000||1,040,008|
|- FlightByDepartTime||Tree Index||48||20,000||960,000|
|- AirportByCode||Tree Index||44||10,000||440,000|
|- ReservByFlight||Hash Index||[*]36||200,000||10,400,008|
|- CustomerByID||Hash Index||[*]36||1,000,000||52,000,008|
|- CustomerByName||Tree Index||56||1,000,000||56,000,000|
[*] For hash indexes, size specifies the individual entry size only. The total size includes the sum of the entries plus the hash itself.