Chapter 8. Maintaining and Repairing the Cluster


VoltDB Home » Documentation » Enterprise Manager Guide

Chapter 8. Maintaining and Repairing the Cluster

The VoltDB Enterprise Manager helps you monitor and maintain your database by displaying statistics about the state of the database. In addition to the graphs described in Chapter 6, Monitoring the Cluster, the dashboard provides access to log messages of any errors or warnings reported by the individual servers or the cluster as a whole.

This chapter explains how to:

  • Detect and evaluate error conditions

  • Remove and rejoin individual nodes for repair

  • Perform regular maintenance tasks for the database using snapshots

  • Report problems

8.1. Detecting and Evaluating Error Conditions

Performance issues can be diagnosed using the graphs and data tables that the dashboard displays. However, if an error occurs, it may not be easy to identify by the performance graphs alone. Instead, you want to see the logs of any errors or warnings that the database generates.

At a high level, the dashboard provides a visual indicator of serious problems through the icons next to the database and server names. If an icon changes from green to yellow or red, you know there is a problem. For example, if a server fails on a cluster running with K-safety, the server icon turns red and the cluster icon changes to yellow, showing that the cluster is still running, but not with its full complement of nodes.

The Enterprise Manager helps you find out exactly what happened by displaying the console logs from all the nodes in the cluster.

  • Click on the name of the database on the list of databases and choose View from the popup menu to display that database in the dashboard.

  • If the log message area in the lower right of the dashboard are collapsed, Click on the Logs heading or the triangle next to it to expand the display and show the log messages.

In the following example, the server Madison suffered a hardware failure and stopped. Since the VoltDB process stopped abruptly, there are no messages from the node itself. But if you look at the log messages, you see an error indicating that the other node sees the failure and identifies the troubled node. At the same time, you can see the icon for Madison has turned gray, indicating it has stopped.