Configuring Terracotta For High Availability

THIS IS ARCHIVE DOCUMENTATION FOR TERRACOTTA v. 3.0.

For the current release, see the current DSO documentation ยป

Release: 3.0.1
Publish Date: May, 2009

Configuring and Testing Terracotta For High Availability

Introduction

High Availability (HA) is an implementation designed to maintain uptime and access to services even during component overloads and failures. Terracotta clusters offer simple and scalable HA implementations based on the Terracotta server array (see Terracotta Server Arrays for more information).

The main features of a Terracotta HA architecture include:

  • Instant failover using a hot standby or multiple active servers – provides continuous uptime and services
  • Configurable automatic internode monitoring – Terracotta HealthChecker
  • Automatic permanent storage of all current shared (in-memory) data – available to all server instances (no loss of application state)
  • Automatic reconnection of temporarily disconnected server instances and clients – restores hot standbys without operator intervention, allows "lost" clients to reconnect

This document may refer to a Terracotta server instance as L2, and a Terracotta client (the node running your application) as L1. These are the shorthand references used in Terracotta configuration files.

Basic High-Availability Configuration

A basic high-availability configuration has the following components:

  • Two or More Terracotta Server Instances
    See Terracotta Server Arrays on how to set up a cluster with multiple Terracotta server instances.
  • Active-Passive Mode
    The <ha> section in the Terracotta configuration file should indicate the mode as networked-active-passive to allow for an active server instance and one or more "hot standby" (backup) server instances. The <networked-active-passive> subsection has a configurable parameter called <election-time> whose value is given in seconds. <election-time>, which sets the duration for elections to elect an active server, is a factor in network latency and server load. The default value is 5 seconds:
<?xml version="1.0" encoding="UTF-8" ?>
<tc:tc-config xmlns:tc="http://www.terracotta.org/config"
                         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                         xsi:schemaLocation="http://www.terracotta.org/schema/terracotta-4.xsd">
  <servers>
...
     <ha>
        <mode>networked-active-passive</mode>
           <networked-active-passive>
               <election-time>5</election-time>
           </networked-active-passive>
       </ha>
  </servers>
  ...
</tc:tc-config>
  • Server-Server Reconnection
    A reconnection mechanism can be enabled to restore lost connections between active and passive Terracotta server instances. See Automatic Server Instance Reconnect for more information.
  • Server-Client Reconnection
    A reconnection mechanism can be enabled to restore lost connections between Terracotta clients and server instances. See Automatic Client Reconnect for more information.

For more information on Terracotta configuration files, see:

High-Availability Features

The following high-availability features can be used to extend the reliability of a Terracotta cluster. These features are controlled using properties set with the <tc-properties> section in the Terracotta configuration file. See the Configuration Guide and Reference for more information.

Automatic Server Instance Reconnect

You can configure an automatic reconnect mechanism to prevent short network disruptions from forcing a restart for any Terracotta server instances in a server array with hot standbys.

If you enable this feature, time-to-failover increases by the timeout value set for the automatic reconnect mechanism.

Configure the following properties for the reconnect mechanism:

  • l2.nha.tcgroupcomm.reconnect.enabled – Enables a server instance to attempt reconnection with its peer server instance after a disconnection is detected. Default: false.
  • l2.nha.tcgroupcomm.reconnect.timeout – Enabled if l2.nha.tcgroupcomm.reconnect.enabled is set to true. Specifies the timeout (in milliseconds) for reconnection. Default: 5000.
Automatic Client Reconnect

Clients disconnected from a Terracotta cluster normally require a restart to rejoin the cluster. You can configure an automatic reconnect mechanism to prevent short network disruptions from forcing a restart for Terracotta clients disconnected from a Terracotta cluster.

Configure the following properties for the reconnect mechanism:

  • l2.l1reconnect.enabled – Enables a client to rejoin a cluster after a disconnection is detected. This property controls a server instance's reaction to such an attempt. It is set on the server instance and is passed to clients by the server instance. A client cannot override the server instance's setting. If a mismatch exists between the client setting and a server instance's setting, and the client attempts to rejoin the cluster, the client emits a mismatch error and exits. Default: false.
  • l2.l1reconnect.timeout.millis – Enabled if l2.l1reconnect.enabled is set to true. Specifies the timeout (in milliseconds) for reconnection. This property controls a server instance's timeout during such an attempt. It is set on the server instance and is passed to clients by the server instance. A client cannot override the server instance's setting. Default: 5000.
HealthChecker

HealthChecker is a connection monitor similar to TCP keep-alive. HealthChecker functions between Terracotta server instances (in High Availability environments), and between Terracotta severs instances and clients. Using HealthChecker, Terracotta nodes can determine if peer nodes are reachable, up, or in a GC operation. If a peer node is unreachable or down, a Terracotta node using HealthChecker can take corrective action.

You configure HealthChecker using certain Terracotta properties, which are grouped into three different categories:

  • Terracotta server instance -> Terracotta client
  • Terracotta Server -> Terracotta Server (HA setup only)
  • Terracotta Client -> Terracotta Server

Property category is indicated by the prefix:

  • l2.healthcheck.l1 indicates L2 -> L1
  • l2.healthcheck.l2 indicates L2 -> L2
  • l1.healthcheck.l2 indicates L1 -> L2

For example, the l2.healthcheck.l2.ping.enabled property applies to L2 -> L2.

The following properties are supported:

Property Definition
ping.enabled Enables (True) or disables (False) ping probes (tests).
ping.idletime The maximum time (in milliseconds) that a node can be silent (have no network traffic) before HealthChecker begins a ping probe to determine if the node is alive.
ping.interval If no response is received to a ping probe, the time (in milliseconds) that HealthChecker waits between retries.
ping.probes If no response is received to a ping probe, the maximum number of retries HealthChecker can attempt.
socketConnect Enables (True) or disables (False) socket-connection tests.
socketConnectTimeout The maximum number of ping.idletime intervals before HealthChecker concludes that the node is dead regardless of successful socket connections.
socketConnectCount The maximum number of socket connections that can be made without a successful ping probe, after which HealthChecker concludes that the target node is dead.

Labels

 
(None)