Configuring Terracotta Clusters For High Availability

Introduction

High Availability (HA) is an implementation designed to maintain uptime and access to services even during component overloads and failures. Terracotta clusters offer simple and scalable HA implementations based on the Terracotta Server Array (see Terracotta Server Arrays for more information).

The main features of a Terracotta HA architecture include:

  • Instant failover using a hot standby or multiple active servers – provides continuous uptime and services
  • Configurable automatic internode monitoring – Terracotta HealthChecker
  • Automatic permanent storage of all current shared (in-memory) data – available to all server instances (no loss of application state)
  • Automatic reconnection of temporarily disconnected server instances and clients – restores hot standbys without operator intervention, allows "lost" clients to reconnect

    Client reconnection refers to reconnecting clients that have not yet been disconnected from the cluster by the Terracotta Server Array. To learn about reconnecting Enterprise Ehcache clients that have been disconnected from their cluster, see Using Rejoin to Automatically Reconnect Terracotta Clients.

TIP: Nomenclature
This document may refer to a Terracotta server instance as L2, and a Terracotta client (the node running your application) as L1. These are the shorthand references used in Terracotta configuration files.

It is important to thoroughly test any High Availability setup before going to production. Suggestions for testing High Availability configurations are provided in this document.

Common Causes of Failures in a Cluster

Failures in a cluster include L1s being ejected, standby L2s becoming active and attempting to take over the cluster while the original active L2 is still functional, long pauses in cluster operations, and even complete cluster failure.

The most common causes of failures in a cluster are interruptions in the network and long Java GC cycles on particular nodes. Tuning the HealthChecker and reconnect features can reduce or eliminate these two problems. However, additional actions should also be considered.

Sporadic disruptions in network connections between L2s and between L2s and L1s can be difficult to track down. Be sure to thoroughly test all network segments connecting the nodes in a cluster, and also test network hardware. Check for speed, noise, reliability, and other applications that grab bandwidth.

Long GC cycles can also be helped by the following:

  • Tuning Java GC to work more efficiently with the clustered application.
  • Refactoring clustered applications that unnecessarily create too much garbage.
  • Ensuring that the problem node has enough memory allocated to the heap.

Other sources of failures in a cluster are disks that are nearly full or are running slowly, and running other applications that compete for a node’s resources.

Using BigMemory to Alleviate GC Slowdowns

Terracotta BigMemory allows L2s to use memory outside of the Java object heap. This also helps prevent another common cause of slowdowns: swapping. Swapping occurs when the system must use disk space to accomodate a data set that is too large for the heap. See Improving Server Performance With BigMemory.

HealthChecker can also be tuned to allow for GC or network interruptions.

Basic High-Availability Configuration

A basic high-availability configuration has the following components:

  • Two or More Terracotta Server Instances

    See Terracotta Server Arrays on how to set up a cluster with multiple Terracotta server instances.

  • Active-Passive Mode

    The <ha> section in the Terracotta configuration file should indicate the mode as networked-active-passive to allow for an active server instance and one or more "hot standby" (backup) server instances. The <networked-active-passive> subsection has a configurable parameter called <election-time> whose value is given in seconds. <election-time>, which sets the duration for elections to elect an active server, is a factor in network latency and server load. The default value is 5 seconds:

    <?xml version="1.0" encoding="UTF-8" ?>
    <tc:tc-config xmlns:tc="http://www.terracotta.org/config"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.terracotta.org/schema/terracotta-5.xsd">
     <servers>
    ...
       <ha>
         <mode>networked-active-passive</mode>
         <networked-active-passive>
           <election-time>5</election-time>
         </networked-active-passive>
       </ha>
     </servers>
     ...
    </tc:tc-config>
    

NOTE: Servers Should Not Share Data Directories
When using networked-active-passive mode, Terracotta server instances must not share data directories. Each server's <data> element should point to a different and preferably local data directory.
  • Server-Server Reconnection

    A reconnection mechanism can be enabled to restore lost connections between active and passive Terracotta server instances. See Automatic Server Instance Reconnect for more information.

  • Server-Client Reconnection

    A reconnection mechanism can be enabled to restore lost connections between Terracotta clients and server instances. See Automatic Client Reconnect for more information.

For more information on Terracotta configuration files, see:

High-Availability Features

The following high-availability features can be used to extend the reliability of a Terracotta cluster. These features are controlled using properties set with the <tc-properties> section at the beginning of the Terracotta configuration file:

<?xml version="1.0" encoding="UTF-8"?>
<!-- All content copyright Terracotta, Inc., unless otherwise indicated. All rights reserved. -->
<tc:tc-config xmlns:tc="http://www.terracotta.org/config"  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"  xsi:schemaLocation="http://www.terracotta.org/schema/terracotta-5.xsd">

 <tc-properties>
   <property name="some.property.name" value="true"/>
   <property name="some.other.property.name" value="true"/>
   <property name="still.another.property.name" value="1024"/>
 </tc-properties>

<!-- The rest of the Terracotta configuration goes here. -->

</tc:tc-config>

See the section Overriding tc.properties in Configuration Guide and Reference for more information.

HealthChecker

HealthChecker is a connection monitor similar to TCP keep-alive. HealthChecker functions between Terracotta server instances (in High Availability environments), and between Terracotta sever instances and clients. Using HealthChecker, Terracotta nodes can determine if peer nodes are reachable, up, or in a GC operation. If a peer node is unreachable or down, a Terracotta node using HealthChecker can take corrective action. HealthChecker is on by default.

You configure HealthChecker using certain Terracotta properties, which are grouped into three different categories:

  • Terracotta server instance -> Terracotta client
  • Terracotta Server -> Terracotta Server (HA setup only)
  • Terracotta Client -> Terracotta Server

Property category is indicated by the prefix:

  • l2.healthcheck.l1 indicates L2 -> L1
  • l2.healthcheck.l2 indicates L2 -> L2
  • l1.healthcheck.l2 indicates L1 -> L2

For example, the l2.healthcheck.l2.ping.enabled property applies to L2 -> L2.

The following HealthChecker properties can be set in the <tc-properties> section of the Terracotta configuration file:

Property Definition
l2.healthcheck.l1.ping.enabled

l2.healthcheck.l2.ping.enabled

l1.healthcheck.l2.ping.enabled

Enables (True) or disables (False) ping probes (tests). Ping probes are high-level attempts to gauge the ability of a remote node to respond to requests and is useful for determining if temporary inactivity or problems are responsible for the node’s silence. Ping probes may fail due to long GC cycles on the remote node.
l2.healthcheck.l1.ping.idletime

l2.healthcheck.l2.ping.idletime

l1.healthcheck.l2.ping.idletime

The maximum time (in milliseconds) that a node can be silent (have no network traffic) before HealthChecker begins a ping probe to determine if the node is alive.
l2.healthcheck.l1.ping.interval

l2.healthcheck.l2.ping.interval

l1.healthcheck.l2.ping.interval

If no response is received to a ping probe, the time (in milliseconds) that HealthChecker waits between retries.
l2.healthcheck.l1.ping.probes

l2.healthcheck.l2.ping.probes

l1.healthcheck.l2.ping.probes

If no response is received to a ping probe, the maximum number (integer) of retries HealthChecker can attempt.
l2.healthcheck.l1.socketConnect

l2.healthcheck.l2.socketConnect

l1.healthcheck.l2.socketConnect

Enables (True) or disables (False) socket-connection tests. This is a low-level connection that determines if the remote node is reachable and can access the network. Socket connections are not affected by GC cycles.
l2.healthcheck.l1.socketConnectTimeout

l2.healthcheck.l2.socketConnectTimeout

l1.healthcheck.l2.socketConnectTimeout

A multiplier (integer) to determine the maximum amount of time that a remote node has to respond before HealthChecker concludes that the node is dead (regardless of previous successful socket connections). The time is determined by multiplying the value in ping.interval by this value.
l2.healthcheck.l1.socketConnectCount

l2.healthcheck.l2.socketConnectCount

l1.healthcheck.l2.socketConnectCount

The maximum number (integer) of successful socket connections that can be made without a successful ping probe. If this limit is exceeded, HealthChecker concludes that the target node is dead.
l1.healthcheck.l2.bindAddress Binds the client to the configured IP address. This is useful where a host has more than one IP address available for a client to use. The default value of "0.0.0.0" allows the system to assign an IP address.
l1.healthcheck.l2.bindPort Set the client’s callback port. Terracotta configuration does not assign clients a port for listening to cluster communications such as that required by HealthChecker. The default value of "0" allows the system to assign a port. A value of "-1" disables a client’s callback port.

The following diagram illustrates how HealthChecker functions.

Terracotta HealthChecker flow diagram.

Calculating HealthChecker Maximum

The following formula can help you compute the maximum time it will take HealthChecker to discover failed or disconnected remote nodes:

Max Time = (ping.idletime) + socketConnectCount * [(ping.interval * ping.probes)  + (socketConnectTimeout * ping.interval)]

Note the following about the formula:

  • The response time to a socket-connection attempt is less than or equal to (socketConnectTimeout * ping.interval). For calculating the worst-case scenario (absolute maximum time), the equality is used. In most real-world situations the socket-connect response time is likely to be close to 0 and the formula can be simplified to the following:

    Max Time = (ping.idletime) + [socketConnectCount * (ping.interval * ping.probes)]
    
  • ping.idletime, the trigger for the full HealthChecker process, is counted once since it is in effect only once each time the process is triggered.

  • socketConnectCount is a multiplier that is in incremented as long as a positive response is received for each socket connection attempt.
  • The formula yields an ideal value, since slight variations in actual times can occur.
  • To prevent clients from disconnecting too quickly in a situation where an active server is temporarily disconnected from both the backup server and those clients, ensure that the Max Time for L1->L2 is approximately 8-12 seconds longer than for L2->L2. If the values are too close together, then in certain situations the active server may kill the backup and refuse to allow clients to reconnect.

Configuration Examples

The configuration examples in this section show settings for L1 -> L2 HealthChecker. However, they apply in the similarly to L2 -> L2 and L2 -> L1, which means that the server is using HealthChecker on the client.

Aggressive

The following settings create an aggressive HealthChecker with low tolerance for short network outages or long GC cycles:

<property name="l1.healthcheck.l2.ping.enabled" value="true" />
<property name="l1.healthcheck.l2.ping.idletime" value="2000" />
<property name="l1.healthcheck.l2.ping.interval" value="1000" />
<property name="l1.healthcheck.l2.ping.probes" value="3" />
<property name="l1.healthcheck.l2.socketConnect" value="true" />
<property name="l1.healthcheck.l2.socketConnectTimeout" value="2" />
<property name="l1.healthcheck.l2.socketConnectCount" value="5" />

According to the HealthChecker "Max Time" formula, the maximum time before a remote node is considered to be lost is computed in the following way:

2000 + 5 [( 3 * 1000 ) + ( 2 * 1000)] = 27000

In this case, after the initial idletime of 2 seconds, the remote failed to respond to ping probes but responded to every socket connection attempt, indicating that the node is reachable but not functional (within the allowed time frame) or in a long GC cycle. This aggressive HealthChecker configuration declares a node dead in no more than 27 seconds.

If at some point the node had been completely unreachable (a socket connection attempt failed), HealthChecker would have declared it dead sooner. Where, for example, the problem is a disconnected network cable, the "Max Time" is likely to be even shorter:

2000 + 1[3 * 1000) + ( 2 * 1000 ) = 7000

In this case, HealthChecker declares a node dead in no more than 7 seconds.

Tolerant

The following settings create a HealthChecker with a higher tolerance for interruptions in network communications and long GC cycles:

<property name="l1.healthcheck.l2.ping.enabled" value="true" />
<property name="l1.healthcheck.l2.ping.idletime" value="5000" />
<property name="l1.healthcheck.l2.ping.interval" value="1000" />
<property name="l1.healthcheck.l2.ping.probes" value="3" />
<property name="l1.healthcheck.l2.socketConnect" value="true" />
<property name="l1.healthcheck.l2.socketConnectTimeout" value="5" />
<property name="l1.healthcheck.l2.socketConnectCount" value="10" />

According to the HealthChecker "Max Time" formula, the maximum time before a remote node is considered to be lost is computed in the following way:

5000 + 10 [( 3 x 1000 ) + ( 5 x 1000)] = 85000

In this case, after the initial idletime of 5 seconds, the remote failed to respond to ping probes but responded to every socket connection attempt, indicating that the node is reachable but not functional (within the allowed time frame) or excessively long GC cycle. This tolerant HealthChecker configuration declares a node dead in no more than 85 seconds.

If at some point the node had been completely unreachable (a socket connection attempt failed), HealthChecker would have declared it dead sooner. Where, for example, the problem is a disconnected network cable, the "Max Time" is likely to be even shorter:

5000 + 1[3 * 1000) + ( 5 * 1000 )] = 13000

In this case, HealthChecker declares a node dead in no more than 13 seconds.

Tuning HealthChecker to Allow for GC or Network Interruptions

GC cycles do not affect a node's ability to respond to socket-connection requests, while network interruptions do. This difference can be used to tune HealthChecker to work more efficiently in a cluster where one or the other of these issues is more likely to occur:

  • To favor detection of network interruptions, adjust the socketConnectCount down (since socket connections will fail). This will allow HealthChecker to disconnect a client sooner due to network issues.

  • To favor detection of GC pauses, adjust the socketConnectCount up (since socket connections will succeed). This will allow clients to remain in the cluster longer when no network disconnection has occurred.

The ping interval increases the time before socket-connection attempts kick in to verify health of a remote node. The ping interval can be adjusted up or down to add more or less tolerance in either of the situations listed above.

Automatic Server Instance Reconnect

An automatic reconnect mechanism can prevent short network disruptions from forcing a restart for any Terracotta server instances in a server array with hot standbys. If not disabled, this mechanism is by default in effect in clusters set to networked-based HA mode.

NOTE: Increased Time-to-Failover
This feature increases time-to-failover by the timeout value set for the automatic reconnect mechanism.

This event-based reconnection mechanism works independently and exclusively of HealthChecker. If HealthChecker has already been triggered, this mechanism cannot be triggered for the same node. If this mechanism is triggered first by an internal Terracotta event, HealthChecker is prevented from being triggered for the same node. The events that can trigger this mechanism are not exposed by API but are logged.

Configure the following properties for the reconnect mechanism:

  • l2.nha.tcgroupcomm.reconnect.enabled – (DEFAULT: false) When set to "true" enables a server instance to attempt reconnection with its peer server instance after a disconnection is detected. Most use cases should benefit from enabling this setting.
  • l2.nha.tcgroupcomm.reconnect.timeout – Enabled if l2.nha.tcgroupcomm.reconnect.enabled is set to true. Specifies the timeout (in milliseconds) for reconnection. Default: 2000. This parameter can be tuned to handle longer network disruptions.

Automatic Client Reconnect

Clients disconnected from a Terracotta cluster normally require a restart to rejoin the cluster. An automatic reconnect mechanism can prevent short network disruptions from forcing a restart for Terracotta clients disconnected from a Terracotta cluster.

NOTE: Performance Impact of Using Automatic Client Reconnect
With this feature, clients waiting to reconnect continue to hold locks. Some application threads may block while waiting to for the client to reconnect.

This event-based reconnection mechanism works independently and exclusively of HealthChecker. If HealthChecker has already been triggered, this mechanism cannot be triggered for the same node. If this mechanism is triggered first by an internal Terracotta event, HealthChecker is prevented from being triggered for the same node. The events that can trigger this mechanism are not exposed by API but are logged.

Configure the following properties for the reconnect mechanism:

  • l2.l1reconnect.enabled – (DEFAULT: false) When set to "true" enables a client to reconnect to a cluster after a disconnection is detected. This property controls a server instance's reaction to such an attempt. It is set on the server instance and is passed to clients by the server instance. A client cannot override the server instance's setting. If a mismatch exists between the client setting and a server instance's setting, and the client attempts to reconnect to the cluster, the client emits a mismatch error and exits. Most use cases should benefit from enabling this setting.
  • l2.l1reconnect.timeout.millis – Enabled if l2.l1reconnect.enabled is set to true. Specifies the timeout (in milliseconds) for reconnection. This property controls a server instance's timeout during such an attempt. It is set on the server instance and is passed to clients by the server instance. A client cannot override the server instance's setting. Default: 2000. This parameter can be tuned to handle longer network disruptions.

Special Client Connection Properties

Client connections can also be tuned for the following special cases:

  • Client failover after server failure
  • First-time client connection

The connection properties associated with these cases are already optimized for most typical environments. If you attempt to tune these properties, be sure to thoroughly test the new settings.

Client Failover After Server Failure

When an active Terracotta server instance fails, and a "hot" standby Terracotta server is available, the formerly "passive" server becomes active. Terracotta clients connected to the previous active server automatically switch to the new active server. However, these clients have a limited window of time to complete the failover.

TIP: Clusters with a Single Server
This reconnection window also applies in a cluster with a single Terracotta server that is restarted. However, a single-server cluster must have its <persistence> element’s <mode> subelement set to "permanent-store" for the reconnection window to take effect.

This window is configured in the Terracotta configuration file using the <client-reconnect-window> element:

<servers>
  <server>
  ...
     <dso>
     ...
        <!-- The reconnect window is configured in seconds, with a default value of 120. The default value is "built in," so the element does not have to be explicitly added unless a different value is required. -->
        <client-reconnect-window>120</client-reconnect-window>
     ...
     </dso>
  ...
  </server>
</servers>

Clients which fail to connect to the new active server must be restarted if they are to successfully rejoin the cluster.

First-Time Client Connection

When a Terracotta client is first started (or restarted), it attempts to connect to a Terracotta server instance based on the following properties:

# -1 == retry all configured servers eternally.
l1.max.connect.retries = -1
# Must the client and server be running the same version of Terracotta?
l1.connect.versionMatchCheck.enabled = true
# Time (in milliseconds) before a socket connection attempt is timed out.
l1.socket.connect.timeout=10000
# Time (in milliseconds; minimum 10) between attempts to connect to a server.
l1.socket.reconnect.waitInterval=1000

A client with l1.max.connect.retries set to a positive integer is given a limited number of attempts (equal to that integer) to connect. If the client fails to connect after the configured number of attempts, it exits.

Effective Client-Server Reconnection Settings: An Example

To prevent unwanted disconnections, it is important to under potentially complex interaction between HA settings and the environment in which your cluster runs. Settings that are not appropriate for a particular environment can lead to unwanted disconnections under certain circumstances.

In general, it is advisable to maintain an L1-L2 HealthChecker timeout that falls between the L2-L2 HealthChecker timeout as modified in the following ineqaulity:

L2-L2 HealthCheck + Election Time
        <  L1-L2 HealthCheck 
                  <  L2-L2 HealthCheck + Election Time + Client Reconnect Window

This allows a cluster's L1s to avoid disconnecting before a client reconnection window is opened (a backup L2 takes over), or to not disconnect if that window is never opened (the original active L2 is still functional). The Election Time and Client Reconnect Window settings, which are found in the Terracotta configuration file, are respectively 5 seconds and 120 seconds by default.

For example, in a cluster where the L2-L2 HealthChecker triggers at 55 seconds, a backup L2 can take over the cluster after 180 seconds (55 + 5 + 120). If the L1-L2 HealthChecker triggers after a time that is greater than 180 seconds, clients may not attempt to reconnect until the reconnect window is closed and it's too late.

If the L1-L2 HealthChecker triggers after a time that is less than 60 seconds (L2-L2 HealthChecker + Election Time), then the clients may disconnect from the active L2 before its failure is determined. Should the active L2 win the election, the disconnected L1s would then be lost.

Beginning with Terracotta 3.6.1, a check is performed at server startup to ensure that L1-L2 HealthChecker settings are within the effective range. If not, a warning is printer with a prescription.