Terracotta Logo (http://www.terracotta.org)

Terracotta Documentation Home

Terracotta 3.2.1 Documentation

Table of Contents •  Back  •  Forward


Register

6.2 Configuring Terracotta For High Availability

High Availability (HA) is an implementation designed to maintain uptime and access to services even during component overloads and failures. Terracotta clusters offer simple and scalable HA implementations based on the Terracotta server array (see Terracotta Server Arrays for more information).

The main features of a Terracotta HA architecture include:

6.2.1 Basic High-Availability Configuration

A basic high-availability configuration has the following components:

<?xml version="1.0" encoding="UTF-8" ?>
<tc:tc-config xmlns:tc="http://www.terracotta.org/config"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.terracotta.org/schema/terracotta-5.xsd">
  <servers>
...
    <ha>
      <mode>networked-active-passive</mode>
      <networked-active-passive>
        <election-time>5</election-time>
      </networked-active-passive>
    </ha>
  </servers>
  ...
</tc:tc-config>

NOTE: Servers Should Not Share Data Directories

When using networked-active-passive mode, Terracotta server instances must not share data directories. Each server's <data> element should point to a different and preferably local data directory.

For more information on Terracotta configuration files, see:

6.2.2 High-Availability Features

The following high-availability features can be used to extend the reliability of a Terracotta cluster. These features are controlled using properties set with the <tc-properties> section at the beginning of the Terracotta configuration file:

<?xml version="1.0" encoding="UTF-8"?>
<!-- All content copyright Terracotta, Inc., unless otherwise indicated. All rights reserved. -->
<tc:tc-config xmlns:tc="http://www.terracotta.org/config"  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"  xsi:schemaLocation="http://www.terracotta.org/schema/terracotta-5.xsd">
 
  <tc-properties>
    <property name="some.property.name" value="true"/>
    <property name="some.other.property.name" value="true"/>
    <property name="still.another.property.name" value="1024"/>
  </tc-properties>
 
<!-- The rest of the Terracotta configuration goes here. -->
 
</tc:tc-config>

See the section Overriding tc.properties in Configuration Guide and Reference for more information.

6.2.2a HealthChecker

HealthChecker is a connection monitor similar to TCP keep-alive. HealthChecker functions between Terracotta server instances (in High Availability environments), and between Terracotta sever instances and clients. Using HealthChecker, Terracotta nodes can determine if peer nodes are reachable, up, or in a GC operation. If a peer node is unreachable or down, a Terracotta node using HealthChecker can take corrective action. HealthChecker is on by default.

You configure HealthChecker using certain Terracotta properties, which are grouped into three different categories:

  • Terracotta server instance -> Terracotta client
  • Terracotta Server -> Terracotta Server (HA setup only)
  • Terracotta Client -> Terracotta Server

Property category is indicated by the prefix:

  • l2.healthcheck.l1 indicates L2 -> L1
  • l2.healthcheck.l2 indicates L2 -> L2
  • l1.healthcheck.l2 indicates L1 -> L2

For example, the l2.healthcheck.l2.ping.enabled property applies to L2 -> L2.

The following HealthChecker properties can be set in the <tc-properties> section of the Terracotta configuration file:

Property
Definition
l2.healthcheck.l1.ping.enabled
l2.healthcheck.l2.ping.enabled
l1.healthcheck.l2.ping.enabled
Enables (True) or disables (False) ping probes (tests). Ping probes are high-level attempts to gauge the ability of a remote node to respond to requests and is useful for determining if temporary inactivity or problems are responsible for the node’s silence. Ping probes may fail due to long GC cycles on the remote node.
l2.healthcheck.l1.ping.idletime
l2.healthcheck.l2.ping.idletime
l1.healthcheck.l2.ping.idletime
The maximum time (in milliseconds) that a node can be silent (have no network traffic) before HealthChecker begins a ping probe to determine if the node is alive.
l2.healthcheck.l1.ping.interval
l2.healthcheck.l2.ping.interval
l1.healthcheck.l2.ping.interval
If no response is received to a ping probe, the time (in milliseconds) that HealthChecker waits between retries.
l2.healthcheck.l1.ping.probes
l2.healthcheck.l2.ping.probes
l1.healthcheck.l2.ping.probes
If no response is received to a ping probe, the maximum number (integer) of retries HealthChecker can attempt.
l2.healthcheck.l1.socketConnect
l2.healthcheck.l2.socketConnect
l1.healthcheck.l2.socketConnect
Enables (True) or disables (False) socket-connection tests. This is a low-level connection that determines if the remote node is reachable and can access the network. Socket connections are not affected by GC cycles.
l2.healthcheck.l1.socketConnectTimeout
l2.healthcheck.l2.socketConnectTimeout
l1.healthcheck.l2.socketConnectTimeout
A multiplier (integer) to determine the maximum amount of time that a remote node has to respond before HealthChecker concludes that the node is dead (regardless of previous successful socket connections). The time is determined by multiplying the value in ping.interval by this value.
l2.healthcheck.l1.socketConnectCount
l2.healthcheck.l2.socketConnectCount
l1.healthcheck.l2.socketConnectCount
The maximum number (integer) of successful socket connections that can be made without a successful ping probe. If this limit is exceeded, HealthChecker concludes that the target node is dead.
l1.healthcheck.l2.bindAddress
Binds the client to the configured IP address. This is useful where a host has more than one IP address available for a client to use. The default value of "0.0.0.0" allows the system to assign an IP address.
l1.healthcheck.l2.bindPort
Set the client’s callback port. Terracotta configuration does not assign clients a port for listening to cluster communications such as that required by HealthChecker. The default value of "0" allows the system to assign a port. A value of "-1" disables a client’s callback port.

The following diagram illustrates how HealthChecker functions.

Terracotta HealthChecker flow diagram.

Calculating HealthChecker Maximum

The following formula can help you compute the maximum time it will take HealthChecker to discover failed or disconnected remote nodes:

Max Time = (ping.idletime) + socketConnectCount * [(ping.interval * ping.probes) + (socketConnectTimeout * ping.interval)]

Note the following about the formula:

  • The response time to a socket-connection attempt is less than or equal to (socketConnectTimeout * ping.interval). For calculating the worst-case scenario (absolute maximum time), the equality is used. In most real-world situations the socket-connect response time is likely to be close to 0 and the formula can be simplified to the following:
  • Max Time = (ping.idletime) + [socketConnectCount * (ping.interval * ping.probes)]

  • ping.idletime, the trigger for the full HealthChecker process, is counted once since it is in effect only once each time the process is triggered.
  • socketConnectCount is a multiplier that is in incremented as long as a positive response is received for each socket connection attempt.
  • The formula yields an ideal value, since slight variations in actual times can occur.
Configuration Examples

The configuration examples in this section show settings for L1 -> L2 HealthChecker. However, they apply in the similarly to L2 -> L2 and L2 -> L1, which means that the server is using HealthChecker on the client.

Aggressive

The following settings create an aggressive HealthChecker with low tolerance for short network outages or long GC cycles:

<property name="l1.healthcheck.l2.ping.enabled" value="true" />
<property name="l1.healthcheck.l2.ping.idletime" value="2000" />
<property name="l1.healthcheck.l2.ping.interval" value="1000" />
<property name="l1.healthcheck.l2.ping.probes" value="3" />
<property name="l1.healthcheck.l2.socketConnect" value="true" />
<property name="l1.healthcheck.l2.socketConnectTimeout" value="2" />
<property name="l1.healthcheck.l2.socketConnectCount" value="5" />

According to the HealthChecker "Max Time" formula, the maximum time before a remote node is considered to be lost is computed in the following way:

2000 + 5 [( 3 * 1000 ) + ( 2 * 1000)] = 27000

In this case, after the initial idletime of 2 seconds, the remote failed to respond to ping probes but responded to every socket connection attempt, indicating that the node is reachable but not functional (within the allowed time frame) or in a long GC cycle. This aggressive HealthChecker configuration declares a node dead in no more than 27 seconds.

If at some point the node had been completely unreachable (a socket connection attempt failed), HealthChecker would have declared it dead sooner. Where, for example, the problem is a disconnected network cable, the "Max Time" is likely to be even shorter:

2000 + 1[3 * 1000) + ( 2 * 1000 ) = 7000

In this case, HealthChecker declares a node dead in no more than 7 seconds.

Tolerant

The following settings create a HealthChecker with a higher tolerance for interruptions in network communications and long GC cycles:

<property name="l1.healthcheck.l2.ping.enabled" value="true" />
<property name="l1.healthcheck.l2.ping.idletime" value="5000" />
<property name="l1.healthcheck.l2.ping.interval" value="1000" />
<property name="l1.healthcheck.l2.ping.probes" value="3" />
<property name="l1.healthcheck.l2.socketConnect" value="true" />
<property name="l1.healthcheck.l2.socketConnectTimeout" value="5" />
<property name="l1.healthcheck.l2.socketConnectCount" value="10" />

According to the HealthChecker "Max Time" formula, the maximum time before a remote node is considered to be lost is computed in the following way:

5000 + 10 [( 3 x 1000 ) + ( 5 x 1000)] = 85000

In this case, after the initial idletime of 5 seconds, the remote failed to respond to ping probes but responded to every socket connection attempt, indicating that the node is reachable but not functional (within the allowed time frame) or excessively long GC cycle. This tolerant HealthChecker configuration declares a node dead in no more than 85 seconds.

If at some point the node had been completely unreachable (a socket connection attempt failed), HealthChecker would have declared it dead sooner. Where, for example, the problem is a disconnected network cable, the "Max Time" is likely to be even shorter:

5000 + 1[3 * 1000) + ( 5 * 1000 )] = 13000

In this case, HealthChecker declares a node dead in no more than 13 seconds.

6.2.2b Automatic Server Instance Reconnect

You can configure an automatic reconnect mechanism to prevent short network disruptions from forcing a restart for any Terracotta server instances in a server array with hot standbys.

NOTE: Increased Time-to-Failover

If you enable this feature, time-to-failover increases by the timeout value set for the automatic reconnect mechanism.

This event-based reconnection mechanism works independently and exclusively of HealthChecker. If HealthChecker has already been triggered, this mechanism cannot be triggered for the same node. If this mechanism is triggered first by an internal Terracotta event, HealthChecker is prevented from being triggered for the same node. The events that can trigger this mechanism are not exposed by API but are logged.

Configure the following properties for the reconnect mechanism:

  • l2.nha.tcgroupcomm.reconnect.enabled – Enables a server instance to attempt reconnection with its peer server instance after a disconnection is detected. Default: false.
  • l2.nha.tcgroupcomm.reconnect.timeout – Enabled if l2.nha.tcgroupcomm.reconnect.enabled is set to true. Specifies the timeout (in milliseconds) for reconnection. Default: 5000.

6.2.2c Automatic Client Reconnect

Clients disconnected from a Terracotta cluster normally require a restart to rejoin the cluster. You can configure an automatic reconnect mechanism to prevent short network disruptions from forcing a restart for Terracotta clients disconnected from a Terracotta cluster.

NOTE: Performance Impact of Using Automatic Client Reconnect

If you enable this feature, clients waiting to reconnect continue to hold locks. Some application threads may block while waiting to for the client to reconnect.

This event-based reconnection mechanism works independently and exclusively of HealthChecker. If HealthChecker has already been triggered, this mechanism cannot be triggered for the same node. If this mechanism is triggered first by an internal Terracotta event, HealthChecker is prevented from being triggered for the same node. The events that can trigger this mechanism are not exposed by API but are logged.

Configure the following properties for the reconnect mechanism:

  • l2.l1reconnect.enabled – Enables a client to rejoin a cluster after a disconnection is detected. This property controls a server instance's reaction to such an attempt. It is set on the server instance and is passed to clients by the server instance. A client cannot override the server instance's setting. If a mismatch exists between the client setting and a server instance's setting, and the client attempts to rejoin the cluster, the client emits a mismatch error and exits. Default: false.
  • l2.l1reconnect.timeout.millis – Enabled if l2.l1reconnect.enabled is set to true. Specifies the timeout (in milliseconds) for reconnection. This property controls a server instance's timeout during such an attempt. It is set on the server instance and is passed to clients by the server instance. A client cannot override the server instance's setting. Default: 5000.

6.2.2d Special Client Connection Properties

Client connections can also be tuned for the following special cases:

  • Client failover after server failure
  • First-time client connection

The connection properties associated with these cases are already optimized for most typical environments. If you attempt to tune these properties, be sure to thoroughly test the new settings.

Client Failover After Server Failure

When an active Terracotta server instance fails, and a "hot" standby Terracotta server is available, the formerly "passive" server becomes active. Terracotta clients connected to the previous active server automatically switch to the new active server. However, these clients have a limited window of time to complete the failover.

TIP: Clusters with a Single Server

This reconnection window also applies in a cluster with a single Terracotta server that is restarted. However, a single-server cluster must have its <persistence> element’s <mode> subelement set to "permanent-store" for the reconnection window to take effect.

This window is configured in the Terracotta configuration file using the <client-reconnect-window> element:

<servers>
   <server>
   ...
      <dso>
      ...
         <!-- The reconnect window is configured in seconds, with a default value of 120. The default value is "built in," so the element does not have to be explicitly added unless a different value is required. -->
         <client-reconnect-window>120</client-reconnect-window>
      ...
      </dso>
   ...
   </server>
</servers>

Clients which fail to connect to the new active server must be restarted if they are to successfully rejoin the cluster.

First-Time Client Connection

When a Terracotta client is first started (or restarted), it attempts to connect to a Terracotta server instance based on the following properties:

# -1 == retry all configured servers eternally.
l1.max.connect.retries = -1
# Must the client and server be running the same version of Terracotta?
l1.connect.versionMatchCheck.enabled = true
# Time (in milliseconds) before a socket connection attempt is timed out.
l1.socket.connect.timeout=10000
# Time (in milliseconds; minimum 10) between attempts to connect to a server.
l1.socket.reconnect.waitInterval=1000

A client with l1.max.connect.retries set to a positive integer is given a limited number of attempts (equal to that integer) to connect. If the client fails to connect after the configured number of attempts, it exits.


Top of 6.2 Configuring Terracotta For High Availability