13.3. Tuning the Snapshot Process

Documentation

VoltDB Home » Documentation » Using VoltDB

13.3. Tuning the Snapshot Process

Snapshot activity — for automated as well as manual, command logging, and other operational snapshots — involves both processing and disk I/O. The snapshot as a whole is broken up into smaller snapshot tasks, each writing a small part of the snapshot to disk. These tasks are interspersed among user transactions in the transaction queue. Since both snapshots and user transactions use the same queue, snapshots can have a noticeable impact on performance (in terms of throughput and/or latency) on a very busy database.

However, there are ways you can adjust the overall impact of snapshots, by controlling the frequency and size of the individual snapshot tasks. By reducing the frequency or size of each snapshot task, the snapshot can take longer but have less impact on the latency of user transactions. There are three ways to manage snapshot activity:

  • Snapshot Priority — Snapshot-specific priority is a simple control that increases or decreases the frequency with which snapshot tasks are added to the queue. You control the snapshot priority by setting the deployment.systemsettings.snapshot.priority property as an integer value. The larger the value, the longer the interval between snapshot tasks.

  • Queue Priority — You can assign snapshots a priority for queueing the same way you can for other user and operational tasks. This means you set the priority relative to other activities such as XDCR, export, or user tasks. You can even set the priority for individual user transactions using the Client2 API. You set the queue priority for snapshots by assigning the deployment.systemsettings.priorities.snapshot.piority property an integer value between 1 and 8. Again, the higher the number, the lower the priority. Note the snapshot-specific priority and queue priority are mutually exclusive. If queue priorities are enabled, the snapshot-specific prioritization will be disabled.

  • Autotuning — Finally, you can choose to let the system select the best option for the frequency and size of snapshot tasks by enabling snapshot autotuning. Rather than setting fixed values, autotuning uses the current workload, measured as the size of the transaction queue, to decide how frequently to run snapshot tasks each time. The busier the queue, the less frequently snapshot tasks will run and the smaller the tasks will be. Resulting in slower snapshots but less impact on latency. On the other hand, because the adjustments are sensitive to the current queue size, when the workload is low the snapshot tasks can be larger and run more frequently. So snapshot performance is not negatively affected during lulls in the database workload. You enable autotuning by setting the deployment.systemsettings.snapshot.autotune.enabled property to true, which can be done either when configuring the database or on the fly while the database is running.

If snapshots are impacting the latency of business transactions, you can try turning on autotuning to adjust the snapshot processing to match the available transaction capacity. The best way to determine if autotuning is being effective is to use transaction performance statistics to compare the results between default behavior with autotuning disabled and after autotuning is enabled.

  1. First, establish baseline performance statistics for the default snapshot behavior. One method of measuring transaction performance is to collect raw statistics for the 99th percentile and maximum latency of transactions before, during, and after a snapshot, which can be done by calling the @Statistics system procedure with the LATENCY selector and examining the columns labeled P99 and MAX. This information is also available in the metrics properties voltdb_initiator_procedure_invoked_time_seconds_bucket and voltdb_initiator_procedure_invoked_time_seconds_max.

  2. Update the database configuration to enable snapshot autotuning by changing the deployment.systemsettings.snapshot.autotune.enabled property to true.

  3. After enabling autotuning, capture the same metrics as in step #1 and compare the before and after results.

  4. To measure the effect on snapshot duration, you can repeat the preceding steps using the @Statistics system procedure SNAPSHOTSUMMARY selector and comparing the DURATION column (or the voltdb_​snapshot_​summary_​info metric) to determine how long snapshots are taking.

Finally, you can use Prometheus and Grafana to graph the output from the metrics in Step #1 to visualize the change in latency over time.