High Availability (HA) is an implementation designed to maintain uptime and access to services even during component overloads and failures. Terracotta clusters offer simple and scalable HA implementations based on the Terracotta server array (see Terracotta Server Arrays for more information).
The main features of a Terracotta HA architecture include:
A basic high-availability configuration has the following components:
See Terracotta Server Arrays on how to set up a cluster with multiple Terracotta server instances.
The <ha> section in the Terracotta configuration file should indicate the mode as networked-active-passive to allow for an active server instance and one or more "hot standby" (backup) server instances. The <networked-active-passive> subsection has a configurable parameter called <election-time> whose value is given in seconds. <election-time>, which sets the duration for elections to elect an active server, is a factor in network latency and server load. The default value is 5 seconds:
<?xml version="1.0" encoding="UTF-8" ?>
<tc:tc-config xmlns:tc="http://www.terracotta.org/config"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.terracotta.org/schema/terracotta-5.xsd">
<servers>
...
<ha>
<mode>networked-active-passive</mode>
<networked-active-passive>
<election-time>5</election-time>
</networked-active-passive>
</ha>
</servers>
...
</tc:tc-config>
A reconnection mechanism can be enabled to restore lost connections between active and passive Terracotta server instances. See Automatic Server Instance Reconnect for more information.
A reconnection mechanism can be enabled to restore lost connections between Terracotta clients and server instances. See Automatic Client Reconnect for more information.
For more information on Terracotta configuration files, see:
The following high-availability features can be used to extend the reliability of a Terracotta cluster. These features are controlled using properties set with the <tc-properties> section at the beginning of the Terracotta configuration file:
<?xml version="1.0" encoding="UTF-8"?>
<!-- All content copyright Terracotta, Inc., unless otherwise indicated. All rights reserved. -->
<tc:tc-config xmlns:tc="http://www.terracotta.org/config" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.terracotta.org/schema/terracotta-5.xsd">
<tc-properties>
<property name="some.property.name" value="true"/>
<property name="some.other.property.name" value="true"/>
<property name="still.another.property.name" value="1024"/>
</tc-properties>
<!-- The rest of the Terracotta configuration goes here. -->
</tc:tc-config>
See the section Overriding tc.properties in Configuration Guide and Reference for more information.
HealthChecker is a connection monitor similar to TCP keep-alive. HealthChecker functions between Terracotta server instances (in High Availability environments), and between Terracotta sever instances and clients. Using HealthChecker, Terracotta nodes can determine if peer nodes are reachable, up, or in a GC operation. If a peer node is unreachable or down, a Terracotta node using HealthChecker can take corrective action. HealthChecker is on by default.
You configure HealthChecker using certain Terracotta properties, which are grouped into three different categories:
Property category is indicated by the prefix:
For example, the
l2.healthcheck.l2.ping.enabled
property applies to L2 -> L2.
The following HealthChecker properties can be set in the <tc-properties> section of the Terracotta configuration file:
The following diagram illustrates how HealthChecker functions.

The following formula can help you compute the maximum time it will take HealthChecker to discover failed or disconnected remote nodes:
Max Time = (ping.idletime) + socketConnectCount * [(ping.interval * ping.probes) + (socketConnectTimeout * ping.interval)]
Note the following about the formula:
Max Time = (ping.idletime) + [socketConnectCount * (ping.interval * ping.probes)]
The configuration examples in this section show settings for L1 -> L2 HealthChecker. However, they apply in the similarly to L2 -> L2 and L2 -> L1, which means that the server is using HealthChecker on the client.
The following settings create an aggressive HealthChecker with low tolerance for short network outages or long GC cycles:
<property name="l1.healthcheck.l2.ping.enabled" value="true" />
<property name="l1.healthcheck.l2.ping.idletime" value="2000" />
<property name="l1.healthcheck.l2.ping.interval" value="1000" />
<property name="l1.healthcheck.l2.ping.probes" value="3" />
<property name="l1.healthcheck.l2.socketConnect" value="true" />
<property name="l1.healthcheck.l2.socketConnectTimeout" value="2" />
<property name="l1.healthcheck.l2.socketConnectCount" value="5" />
According to the HealthChecker "Max Time" formula, the maximum time before a remote node is considered to be lost is computed in the following way:
2000 + 5 [( 3 * 1000 ) + ( 2 * 1000)] = 27000
In this case, after the initial idletime of 2 seconds, the remote failed to respond to ping probes but responded to every socket connection attempt, indicating that the node is reachable but not functional (within the allowed time frame) or in a long GC cycle. This aggressive HealthChecker configuration declares a node dead in no more than 27 seconds.
If at some point the node had been completely unreachable (a socket connection attempt failed), HealthChecker would have declared it dead sooner. Where, for example, the problem is a disconnected network cable, the "Max Time" is likely to be even shorter:
2000 + 1[3 * 1000) + ( 2 * 1000 ) = 7000
In this case, HealthChecker declares a node dead in no more than 7 seconds.
The following settings create a HealthChecker with a higher tolerance for interruptions in network communications and long GC cycles:
<property name="l1.healthcheck.l2.ping.enabled" value="true" />
<property name="l1.healthcheck.l2.ping.idletime" value="5000" />
<property name="l1.healthcheck.l2.ping.interval" value="1000" />
<property name="l1.healthcheck.l2.ping.probes" value="3" />
<property name="l1.healthcheck.l2.socketConnect" value="true" />
<property name="l1.healthcheck.l2.socketConnectTimeout" value="5" />
<property name="l1.healthcheck.l2.socketConnectCount" value="10" />
According to the HealthChecker "Max Time" formula, the maximum time before a remote node is considered to be lost is computed in the following way:
5000 + 10 [( 3 x 1000 ) + ( 5 x 1000)] = 85000
In this case, after the initial idletime of 5 seconds, the remote failed to respond to ping probes but responded to every socket connection attempt, indicating that the node is reachable but not functional (within the allowed time frame) or excessively long GC cycle. This tolerant HealthChecker configuration declares a node dead in no more than 85 seconds.
If at some point the node had been completely unreachable (a socket connection attempt failed), HealthChecker would have declared it dead sooner. Where, for example, the problem is a disconnected network cable, the "Max Time" is likely to be even shorter:
5000 + 1[3 * 1000) + ( 5 * 1000 )] = 13000
In this case, HealthChecker declares a node dead in no more than 13 seconds.
You can configure an automatic reconnect mechanism to prevent short network disruptions from forcing a restart for any Terracotta server instances in a server array with hot standbys.
This event-based reconnection mechanism works independently and exclusively of HealthChecker. If HealthChecker has already been triggered, this mechanism cannot be triggered for the same node. If this mechanism is triggered first by an internal Terracotta event, HealthChecker is prevented from being triggered for the same node. The events that can trigger this mechanism are not exposed by API but are logged.
Configure the following properties for the reconnect mechanism:
l2.nha.tcgroupcomm.reconnect.enabled
– Enables a server instance to attempt reconnection with its peer server instance after a disconnection is detected. Default: false.
l2.nha.tcgroupcomm.reconnect.timeout
– Enabled if l2.nha.tcgroupcomm.reconnect.enabled is set to true. Specifies the timeout (in milliseconds) for reconnection. Default: 5000.Clients disconnected from a Terracotta cluster normally require a restart to rejoin the cluster. You can configure an automatic reconnect mechanism to prevent short network disruptions from forcing a restart for Terracotta clients disconnected from a Terracotta cluster.
This event-based reconnection mechanism works independently and exclusively of HealthChecker. If HealthChecker has already been triggered, this mechanism cannot be triggered for the same node. If this mechanism is triggered first by an internal Terracotta event, HealthChecker is prevented from being triggered for the same node. The events that can trigger this mechanism are not exposed by API but are logged.
Configure the following properties for the reconnect mechanism:
l2.l1reconnect.enabled
– Enables a client to rejoin a cluster after a disconnection is detected. This property controls a server instance's reaction to such an attempt. It is set on the server instance and is passed to clients by the server instance. A client cannot override the server instance's setting. If a mismatch exists between the client setting and a server instance's setting, and the client attempts to rejoin the cluster, the client emits a mismatch error and exits. Default: false.
l2.l1reconnect.timeout.millis
– Enabled if
l2.l1reconnect.enabled
is set to true. Specifies the timeout (in milliseconds) for reconnection. This property controls a server instance's timeout during such an attempt. It is set on the server instance and is passed to clients by the server instance. A client cannot override the server instance's setting. Default: 5000.Client connections can also be tuned for the following special cases:
The connection properties associated with these cases are already optimized for most typical environments. If you attempt to tune these properties, be sure to thoroughly test the new settings.
When an active Terracotta server instance fails, and a "hot" standby Terracotta server is available, the formerly "passive" server becomes active. Terracotta clients connected to the previous active server automatically switch to the new active server. However, these clients have a limited window of time to complete the failover.
This window is configured in the Terracotta configuration file using the <client-reconnect-window> element:
<servers>
<server>
...
<dso>
...
<!-- The reconnect window is configured in seconds, with a default value of 120. The default value is "built in," so the element does not have to be explicitly added unless a different value is required. -->
<client-reconnect-window>120</client-reconnect-window>
...
</dso>
...
</server>
</servers>
Clients which fail to connect to the new active server must be restarted if they are to successfully rejoin the cluster.
When a Terracotta client is first started (or restarted), it attempts to connect to a Terracotta server instance based on the following properties:
# -1 == retry all configured servers eternally.
l1.max.connect.retries = -1
# Must the client and server be running the same version of Terracotta?
l1.connect.versionMatchCheck.enabled = true
# Time (in milliseconds) before a socket connection attempt is timed out.
l1.socket.connect.timeout=10000
# Time (in milliseconds; minimum 10) between attempts to connect to a server.
l1.socket.reconnect.waitInterval=1000
A client with
l1.max.connect.retries
set to a positive integer is given a limited number of attempts (equal to that integer) to connect. If the client fails to connect after the configured number of attempts, it exits.