Snapshot activity — for automated as well as manual, command logging, and other operational snapshots — involves both processing and disk I/O. The snapshot as a whole is broken up into smaller snapshot tasks, each writing a small part of the snapshot to disk. These tasks are interspersed among user transactions in the transaction queue. Since both snapshots and user transactions use the same queue, snapshots can have a noticeable impact on performance (in terms of throughput and/or latency) on a very busy database.
However, there are ways you can adjust the overall impact of snapshots, by controlling the frequency and size of the individual snapshot tasks. By reducing the frequency or size of each snapshot task, the snapshot can take longer but have less impact on the latency of user transactions. There are three ways to manage snapshot activity:
Snapshot Priority — Snapshot-specific priority is a simple control that
increases or decreases the frequency with which snapshot tasks are added to the queue. You control the snapshot priority
by setting the deployment.systemsettings.snapshot.priority
property as an integer value. The larger
the value, the longer the interval between snapshot tasks.
Queue Priority — You can assign snapshots a priority for queueing the same
way you can for other user and operational tasks. This means you set the priority relative to other activities such as
XDCR, export, or user tasks. You can even set the priority for individual user transactions using the Client2 API. You
set the queue priority for snapshots by assigning the
deployment.systemsettings.priorities.snapshot.piority
property an integer value between 1 and 8.
Again, the higher the number, the lower the priority. Note the snapshot-specific priority and queue priority are
mutually exclusive. If queue priorities are enabled, the snapshot-specific prioritization will be disabled.
Autotuning — Finally, you can choose to let the system select the best
option for the frequency and size of snapshot tasks by enabling snapshot autotuning. Rather than setting fixed values,
autotuning uses the current workload, measured as the size of the transaction queue, to decide how frequently to run
snapshot tasks each time. The busier the queue, the less frequently snapshot tasks will run and the smaller the tasks
will be. Resulting in slower snapshots but less impact on latency. On the other hand, because the adjustments are
sensitive to the current queue size, when the workload is low the snapshot tasks can be larger and run more frequently.
So snapshot performance is not negatively affected during lulls in the database workload. You enable autotuning by
setting the deployment.systemsettings.snapshot.autotune.enabled
property to true, which can be done
either when configuring the database or on the fly while the database is running.
If snapshots are impacting the latency of business transactions, you can try turning on autotuning to adjust the snapshot processing to match the available transaction capacity. The best way to determine if autotuning is being effective is to use transaction performance statistics to compare the results between default behavior with autotuning disabled and after autotuning is enabled.
First, establish baseline performance statistics for the default snapshot behavior. One method of measuring
transaction performance is to collect raw statistics for the 99th percentile and maximum latency of transactions before,
during, and after a snapshot, which can be done by calling the @Statistics system procedure with the LATENCY selector
and examining the columns labeled P99 and MAX. This information is also available in the metrics properties
voltdb_initiator_procedure_invoked_time_seconds_bucket
and
voltdb_initiator_procedure_invoked_time_seconds_max
.
Update the database configuration to enable snapshot autotuning by changing the
deployment.systemsettings.snapshot.autotune.enabled
property to true.
After enabling autotuning, capture the same metrics as in step #1 and compare the before and after results.
To measure the effect on snapshot duration, you can repeat the preceding steps using the @Statistics system
procedure SNAPSHOTSUMMARY selector and comparing the DURATION column (or the
voltdb_snapshot_summary_info
metric) to determine how long snapshots are
taking.
Finally, you can use Prometheus and Grafana to graph the output from the metrics in Step #1 to visualize the change in latency over time.