Access Keys:
Skip to content (Access Key - 0)
(Access Key - l)
Release: 3.0.1
Publish Date: May, 2009

Documentation Archive »

Configuring and Testing Terracotta For High Availability

Introduction

High Availability (HA) is an implementation designed to maintain uptime and access to services even during component overloads and failures. Terracotta clusters offer simple and scalable HA implementations based on the Terracotta server array (see Terracotta Server Arrays for more information).

The main features of a Terracotta HA architecture include:

  • Instant failover using a hot standby or multiple active servers – provides continuous uptime and services
  • Configurable automatic internode monitoring – Terracotta HealthChecker
  • Automatic permanent storage of all current shared (in-memory) data – available to all server instances (no loss of application state)
  • Automatic reconnection of temporarily disconnected server instances and clients – restores hot standbys without operator intervention, allows "lost" clients to reconnect

This document may refer to a Terracotta server instance as L2, and a Terracotta client (the node running your application) as L1. These are the shorthand references used in Terracotta configuration files.

Basic High-Availability Configuration

A basic high-availability configuration has the following components:

  • Two or More Terracotta Server Instances
    See Terracotta Server Arrays on how to set up a cluster with multiple Terracotta server instances.
  • Active-Passive Mode
    The <ha> section in the Terracotta configuration file should indicate the mode as networked-active-passive to allow for an active server instance and one or more "hot standby" (backup) server instances. The <networked-active-passive> subsection has a configurable parameter called <election-time> whose value is given in seconds. <election-time>, which sets the duration for elections to elect an active server, is a factor in network latency and server load. The default value is 5 seconds:
    <?xml version="1.0" encoding="UTF-8" ?>
    <tc:tc-config xmlns:tc="http://www.terracotta.org/config"
                             xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                             xsi:schemaLocation="http://www.terracotta.org/schema/terracotta-4.xsd">
      <servers>
    ...
         <ha>
            <mode>networked-active-passive</mode>
               <networked-active-passive>
                   <election-time>5</election-time>
               </networked-active-passive>
           </ha>
      </servers>
      ...
    </tc:tc-config>
  • Server-Server Reconnection
    A reconnection mechanism can be enabled to restore lost connections between active and passive Terracotta server instances. See Automatic Server Instance Reconnect for more information.
  • Server-Client Reconnection
    A reconnection mechanism can be enabled to restore lost connections between Terracotta clients and server instances. See Automatic Client Reconnect for more information.

For more information on Terracotta configuration files, see:

High-Availability Features

The following high-availability features can be used to extend the reliability of a Terracotta cluster. These features are controlled using properties set with the <tc-properties> section in the Terracotta configuration file. See the Configuration Guide and Reference for more information.

Automatic Server Instance Reconnect

You can configure an automatic reconnect mechanism to prevent short network disruptions from forcing a restart for any Terracotta server instances in a server array with hot standbys.

If you enable this feature, time-to-failover increases by the timeout value set for the automatic reconnect mechanism.

Configure the following properties for the reconnect mechanism:

  • l2.nha.tcgroupcomm.reconnect.enabled – Enables a server instance to attempt reconnection with its peer server instance after a disconnection is detected. Default: false.
  • l2.nha.tcgroupcomm.reconnect.timeout – Enabled if l2.nha.tcgroupcomm.reconnect.enabled is set to true. Specifies the timeout (in milliseconds) for reconnection. Default: 5000.
Automatic Client Reconnect

Clients disconnected from a Terracotta cluster normally require a restart to rejoin the cluster. You can configure an automatic reconnect mechanism to prevent short network disruptions from forcing a restart for Terracotta clients disconnected from a Terracotta cluster.

Configure the following properties for the reconnect mechanism:

  • l2.l1reconnect.enabled – Enables a client to rejoin a cluster after a disconnection is detected. This property controls a server instance's reaction to such an attempt. It is set on the server instance and is passed to clients by the server instance. A client cannot override the server instance's setting. If a mismatch exists between the client setting and a server instance's setting, and the client attempts to rejoin the cluster, the client emits a mismatch error and exits. Default: false.
  • l2.l1reconnect.timeout.millis – Enabled if l2.l1reconnect.enabled is set to true. Specifies the timeout (in milliseconds) for reconnection. This property controls a server instance's timeout during such an attempt. It is set on the server instance and is passed to clients by the server instance. A client cannot override the server instance's setting. Default: 5000.
HealthChecker

HealthChecker is a connection monitor similar to TCP keep-alive. HealthChecker functions between Terracotta server instances (in High Availability environments), and between Terracotta severs instances and clients. Using HealthChecker, Terracotta nodes can determine if peer nodes are reachable, up, or in a GC operation. If a peer node is unreachable or down, a Terracotta node using HealthChecker can take corrective action.

You configure HealthChecker using certain Terracotta properties, which are grouped into three different categories:

  • Terracotta server instance -> Terracotta client
  • Terracotta Server -> Terracotta Server (HA setup only)
  • Terracotta Client -> Terracotta Server

Property category is indicated by the prefix:

  • l2.healthcheck.l1 indicates L2 -> L1
  • l2.healthcheck.l2 indicates L2 -> L2
  • l1.healthcheck.l2 indicates L1 -> L2

For example, the l2.healthcheck.l2.ping.enabled property applies to L2 -> L2.

The following properties are supported:

Property Definition Default L2 to L1 Default L2 to L2 Default L1 to L2
ping.enabled Enables (True) or disables (False) ping tests. True True True
ping.idletime The time (in milliseconds) that HealthChecker waits between ping tests. 30000 5000 5000
ping.interval If no response is received to a ping test, the time (in milliseconds) that HealthChecker waits between retries. 10000 1000 1000
ping.probes If no response is received to a ping test, the maximum number of retries HealthChecker can attempt. 6 3 3
socketConnect Enables (True) or disables (False) socket-connection tests. True True True
socketConnectCount The maximum failed socket connections HealthChecker allows before assuming that the target node is dead. 2 2 2
socketConnectTimeout Regardless of successful socket connections, the number (integer) of consecutive ping.intervals HealthChecker allows before assuming that the target node is dead. 2 10 10

High Availability Network Architecture And Testing

In order to take advantage of the Terracotta active-passive server configuration, certain network configurations are necessary to ensure that there is no split-brain and that the L1s and the L2s will behave in a deterministic manner when a failure does occur (network, machine, etc.).

If you've turned off disk caching to prevent loss of data in case of a power outage to all Terracotta server instances in the cluster, performance may suffer substantial degradation. See the this troubleshooting issue for more information.

This document outlines two possible network configurations that are well known to work with Terracotta failover. Of course it is possible for other network configurations to work reliably, however the configurations listed in this document have been well tested and are fully supported.

For assistance with configurations not listed in this document, please contact us directly. We can help you determine whether you specific configuration will be compatible with Terracotta.

Contact Us For More Information »

Deployment Configuration: Simple (no network redundancy)

Description

This is the simplest network configuration. There is no network redundancy so when any failure occurs, there is a good chance that all or part of the cluster will stop functioning. All fail over activity is left up to the Terracotta software only.

In this diagram, the IP addresses are merely examples to demonstrate that the L1s (L1a & L1b) and L2s (TCserverA & TCserverB) can live on different subnets. The actual addressing scheme is specific to your environment. There is a single switch that is a single point of failure.

Additional configuration

There is no additional network or operating system configuration necessary in this configuration. Each machine needs a proper network configuration (IP address, subnet mask, gateway, DNS, NTP, hostname) and must be plugged into the network.

Test Plan - Network Failures Non-redundant Network

To determine that your configuration is correct, use the following tests to confirm all failure scenarios behave as expected.

TestID Failure Expected Outcome
FS1 Loss of L1a (link or system) Cluster continues as normal using only L1b
FS2 Loss of L1b (link or system) Cluster continues as normal using only L1a
FS3 Loss of L1a & L1b Non-functioning cluster
FS4 Loss of Switch Non-functioning cluster
FS5 Loss of Active L2 (link or system) Passive L2 becomes new Active L2, L1s fail over to new Active L2
FS6 Loss of Passive L2 Cluster continues as normal without TC redundancy
FS7 Loss of TCservers A & B Non-functioning cluster

Test Plan - Network Tests Non-redundant Network

After the network has been configured, you can test your configuration with simple ping tests.

TestID Host Action Expected Outcome
NT1 all ping every other host successful ping
NT2 all pull network cable during continuous ping ping failure until link restored
NT3 switch reload all pings cease until reload complete and links restored

Deployment Configuration: Fully Redundant

Description

This is the fully redundant network configuration. It relies on the fail over capabilities of Terracotta, the switches, and the operating system. In this scenario it is even possible to sustain certain double failures and still maintain a fully functioning cluster.

In this diagram, the IP addressing scheme is merely to demonstrate that the L1s (L1a & L1b) can be on a different subnet than the L2s (TCserverA & TCserverB). The actual addressing scheme will be specific to your environment. If you choose to implement with a single subnet, then there will be no need for VRRP/HSRP but you will still need to configure a single VLAN (can be VLAN 1) for all TC cluster machines.

In this diagram, there are two switches that are connected with trunked links for redundancy and which implement Virtual Router Redundancy Protocol (VRRP) or HSRP to provide redundant network paths to the cluster servers in the event of a switch failure. Additionally, all servers are configured with both a primary and secondary network link which is controlled by the operating system. In the event of a NIC or link failure on any single link, the operating system should fail over to the backup link without disturbing (e.g. restarting) the Java processes (L1 or L2) on the systems.

The Terracotta fail over is identical to that in the simple case above, however both NIC cards on a single host would need to fail in this scenario before the TC software initiates any fail over of its own.

Additional configuration

  • Switch - Switches need to implement VRRP or HSRP to provide redundant gateways for each subnet. Switches also need to have a trunked connection of two or more lines in order to prevent any single link failure from splitting the virtual router in two.
  • Operating System - Hosts need to be configured with bonded network interfaces connected to the two different switches. For Linux, choose mode 1. More information about Linux channel bonding can be found in the RedHat Linux Reference Guide. Pay special attention to the amount of time it takes for your VRRP or HSRP implementation to reconverge after a recovery. You don't want your NICs to change to a switch that is not ready to pass traffic. This should be tunable in your bonding configuration.

Test Plan - Network Failures Redundant Network

The following tests continue the tests listed in Network Failures (Pt. 1). Use these tests to confirm that your network is configured properly.

TestID Failure Expected Outcome
FS8 Loss of any primary network link Failover to standby link
FS9 Loss of all primary links All nodes fail to their secondary link
FS10 Loss of any switch Remaining switch assumes VRRP address and switches fail over NICs if necessary
FS11 Loss of any L1 (both links or system) Cluster continues as normal using only other L1
FS12 Loss of Active L2 Passive L2 becomes the new Active L2, All L1s fail over to the new Active L2
FS13 Loss of Passive L2 Cluster continues as normal without TC redundancy
FS14 Loss of both switches non-functioning cluster
FS15 Loss of single link in switch trunk Cluster continues as normal without trunk redundancy
FS16 Loss of both trunk links possible non-functioning cluster depending on VRRP or HSRP implementation
FS17 Loss of both L1s non-functioning cluster
FS18 Loss of both L2s non-functioning cluster

Test Plan - Network Testing Redundant Network

After the network has been configured, you can test your configuration with simple ping tests and various failure scenarios.

The test plan for Network Testing consists of the following tests:

TestID Host Action Expected Outcome
NT4 any ping every other host successful ping
NT5 any pull primary link during continuous ping to any other host failover to secondary link, no noticable network interruption
NT6 any pull standby link during continuous ping to any other host no effect
NT7 Active L2 pull both network links Passive L2 becomes Active, L1s fail over to new Active L2
NT8 Passive L2 pull both network links no effect
NT9 switchA reload nodes detect link down and fail to standby link, brief network outage if VRRP transition occurs
NT10 switchB reload brief network outage if VRRP transition occurs
NT11 switch pull single trunk link no effect

Cluster Tests with Terracotta

All tests in this section should be run after the Network Tests succeed.

Test Plan - Active L2 System Loss Tests - verify Passive Takeover

The test plan for Passive takeover consists of the following tests:

TestID Test Setup Steps Expected Result
TAL1 Active L2 Loss - Kill L2-A is active, L2-B is passive. All systems are running and available to take traffic. 1. Run app<br>2. Kill -9 Terracotta PID on L2-A (Active) L2-B(passive) becomes active. Takes the load. No drop in TPS on Failover.
TAL2 Active L2 Loss - clean shutdown L2-A is active, L2-B is passive. All systems are running and available to take traffic. 1. Run app 2.Run ~/bin/stop-tc-server.sh on L2-A (Active) L2-B(passive) becomes active. Takes the load. No drop in TPS on Failover.
TAL3 Active L2 Loss - Power Down L2-A is Active, L2-B is passive. All systems are running and available to take traffic 1. Run app 2. Power down L2-A (Active) L2-B(passive) becomes active. Takes the load. No drop in TPS on Failover.
TAL4 Active L2 Loss - Reboot L2-A is Active, L2-B is passive. All systems are running and available to take traffic 1. Run app 2. Reboot L2-A (Active) L2-B(passive) becomes active. Takes the load. No drop in TPS on Failover.
TAL5 Active L2 Loss - Pull Plug L2-A is Active, L2-B is passive. All systems are running and available to take traffic 1. Run app 2. Pull the power cable on L2-A (Active) L2-B(passive) becomes active. Takes the load. No drop in TPS on Failover.

Test Plan - Passive L2 System Loss Tests

System loss tests confirms High Availability in the event of loss of a single system. This section outlines tests for testing failure of the Terracotta Passive server.

The test plan for testing Terracotta Passive Failures consist of the following tests:

TestID Test Setup Steps Expected Result
TPL1 Passive L2 loss - kill L2-A is active, L2-B is passive. All systems are running and available to take traffic. 1. Run app 2. Kill -9 L2-B (Passive) data directory needs to be cleaned up, then when L2-B is restarted, it re-synchs state from Active Server.
TPL2 Passive L2 loss -clean L2-A is active, L2-B is passive. All systems are running and available to take traffic 1. Run app 2. Run ~/bin/stop-tc-server.sh on L2-B (passive) data directory needs to be cleaned up, then when L2-B is restarted, it re-synchs state from Active Server.
TPL3 Passive L2 loss -power down L2-A is active, L2-B is passive. All systems are running and available to take traffic 1. Run app 2. Power down L2-B (Passive) data directory needs to be cleaned up, then when L2-B is restarted, it re-synchs state from Active Server.
TPL4 Passive L2 loss -reboot L2-A is active, L2-B is passive. All systems are running and available to take traffic 1. Run app 2. Reboot L2-B (Passive) data directory needs to be cleaned up, then when L2-B is restarted, it re-synchs state from Active Server.
TPL5 Passive L2 loss -Pull Plug L2-A is active, L2-B is passive. All systems are running and available to take traffic 1. Run app 2. Pull plug on L2-B (Passive) data directory needs to be cleaned up, then when L2-B is restarted, it re-synchs state from Active Server.

Test Plan - Failover/Failback Tests

This section outlines tests to confirm the cluster ability to fail-over to the Passive Terracotta server, and fail back.

The test plan for testing fail over and fail back consists of the following tests:

TestID Test Setup Steps Expected Result
TFO1 Failover/Failback L2-A is active, L2-B is passive. All systems are running and available to take traffic 1. Run application 2. Kill -9 (or run stop-tc-server) on L2-A (Active) 3. After L2-B takes over as Active, start-tc-server on L2-A. (L2-A is now passive) 4. Kill -9 (or run stop-tc-server) on L2-B. (L2-A is now Active) After first failover L2-A->L2-B, txns should continue. L2-A should come up cleanly in passive mode when tc-server is run. When second failover occurs L2-B->L2-A, L2-A should process txns.

Test Plan - Loss of Switch Tests

This test can only be run on a redundant network

This section outlines testing the loss of a switch in a redundant network, and confirming that no interrupt of service occurs.

The test plan for testing failure of a single switch consists of the following tests:

TestID Test Setup Steps Expected Result
TSL1 Loss of 1 Switch 2 Switches in redundant configuration. L2-A is active, L2-B is passive. All systems are running and available to take traffic. 1. Run application 2. Power down/pull plug on Switch All traffic transparently moves to switch 2 with no interruptions

Test Plan - Loss of Network Connectivity

This section outlines testing the loss of network connectivity.

The test plan for testing failure of the network consists of the following tests:

TestID Test Setup Steps Expected Result
TNL1 Loss of NIC wiring (Active) L2-A is active, L2-B is passive. All systems are runnng and available to traffic 1. Run application 2. Remove Network Cable on L2-A All traffic transparently moves to L2-B with no interruptions
TNL2 Loss of NIC wiring (Passive) L2-A is active, L2-B is passive. All systems are runnng and available to traffic 1. Run application 2. Remove Network Cable on L2-B No user impact on cluster

Test Plan - Terracotta Cluster Failure

This section outlines the tests to confirm successful continued operations in the face Terracotta Cluster failures.

The test plan for testing Terracotta Cluster failures consists of the following tests:

TestID Test Setup Steps Expected Result
TF1 Process Failure Recovery L2-A is active, L2-B is passive. All systems are runnng and available to traffic 1. Run application 2. Bring down all L1s and L2s 3. Start L2s then L1s Cluster should come up and begin taking txns again
TF2 Server Failure Recovery L2-A is active, L2-B is passive. All systems are runnng and available to traffic 1. Run application 2. Power down all machines 3. Start L2s and then L1s Should be able to run application once all servers are up.

Client Failure Tests

This section outlines tests to confirm successful continued operations in the face of Terracotta client failures.

The test plan for testing Terracotta Client failures consists of the following tests:

TestID Test Setup Steps Expected Result
TCF1 L1 Failure - L2-A is active, L2-B is passive. 2 L1s L1-A and L1-B All systems are running and available to traffic 1. Run application 2. kill -9 L1-A. L1-B should take all incoming traffic. Some timeouts may occur due to txns in process when L1 fails over.
Adaptavist Theme Builder (3.2.0) Powered by Atlassian Confluence 2.8.2, the Enterprise Wiki.
Free theme builder license