Under certain conditions, the use of TCP segmentation offload (TSO) and generic receive offload (GRO) can cause nodes to randomly drop out of a cluster. These settings let the system to batch network packets, producing unnecessary latency and interfering with the necessary communication between VoltDB cluster nodes. The symptoms of this problem are that nodes timeout — that is, the rest of the cluster thinks they have failed — although the node is still running and no other network issues (such as a network partition) are the cause.
Disabling TSO and GRO is recommended for any VoltDB clusters that experience such instability. The commands to disable offloading are the following, where N is replaced by the number of the ethernet card:
ethtool -K ethN tso off ethtool -K ethN gro off
Note that these commands disable offloading temporarily. You must issue these commands every time the node reboots or, preferably, put them in a startup configuration file.
It is also a good idea to check that TCP_RETRIES2 has not been altered. Setting TCP_RETRIES2 too low (below 8) can cause similar unpredictable timeouts. See the description of the VoltDB heartbeat timeout setting in Section A.3.7, “Heartbeat” for details.