This document shows you how to architect a Terracotta server array to add cluster reliability, availability, and scalability.
Terracotta server arrays can vary from a basic two-node tandem to a multi-node array providing configurable scale, high performance, and deep failover coverage.
The main features of a Terracotta server array include:
The major components of a Terracotta server array are:
Figure 1 illustrates a Terracotta cluster with three mirror groups. Each mirror group has an active server and a standby, and manages one third of the shared data in the cluster.

A Terracotta cluster has the following functional characteristics:
There can never be more than one active server instance per mirror group, but there can be any number of standbys. In Fig. 1, Mirror Group 1 could have two standbys, while Mirror Group 3 could have four standbys. However, a performance overhead may become evident when adding more standby servers due to the load placed on the active server by having to synchronize with each standby.
The number of partitions equals the number of mirror groups. In Fig. 1, each mirror group has one third of the shared data in the cluster.
Failover is provided within each mirror group, not across mirror groups. This is because mirror groups provide scale by managing discrete portions of the shared data in the cluster -- they do not replicate each other. In Fig. 1, if Mirror Group 1 goes down, the cluster must pause (stop work) until Mirror Group 1 is back up with its portion of the shared data intact.
No additional configuration is required to coordinate active server instances.
In Fig. 1, the L2 PASSIVE (standby) servers can be shut down and replaced with no affect on cluster functions. However, to add or remove an entire mirror group, the cluster must be brought down. Note also that in this case the original Terracotta configuration file is still in effect and no new servers can be added. Replaced standby servers must have the same address (hostname or IP address).
To successfully configure a Terracotta server array using the Terracotta configuration file, note the following:
<ha>
<mode>networked-active-passive</mode>
<networked-active-passive>
<election-time>5</election-time>
</networked-active-passive>
</ha>
Be default, the active-passive mode used is "disk-based-active-passive", which is not recommended except for demonstration purposes or where a networked connection is not feasible. See the cluster architecture section in the Terracotta Concept and Architecture Guide for more information on disk-based active-passive mode.
"permanent-store" means that application state, or shared in-memory data, is backed up to disk. In case of failure, it is automatically restored. Shared data is removed from disk once it no longer exists in any client's memory.
For more information on Terracotta configuration files, see:
Certain versions of Terracotta provide tools to create backups of the Terracotta server array's disk store. See the Terracotta Operations Center and the Database Backup Utility (backup-data) for more information.
Any Terracotta server array handles perceived client disconnection (for example a network failure, a long client GC, or node failure) based on the configuration of the HealthChecker or Automatic Client Reconnect mechanisms. A disconnected client also attempts to reconnect based on these mechanisms. The client tries to reconnect first to the initial server, then to any other servers set up in its Terracotta configuration. To preserve data integrity, clients resend any transactions for which they have not received server acks.
For more information on client behavior, see Cluster Structure and Behavior.
The Terracotta cluster can be configured into a number of different setups to serve both deployment stage and production needs. Note that in multi-setup setups, failover characteristics are affected by HA settings (see Configuring Terracotta For High Availability).
In a development environment, persisting shared data is often unnecessary and even inconvenient. It puts more load on the server, while accumulated data can fill up disks or prevent automatic restarts of servers, requiring manual intervention. Running a single-server Terracotta cluster without persistence is a good solution for creating a more efficient development environment.

By default, a single Terracotta server is in "temporary-swap-mode", which means it lacks persistence. Its configuration could look like the following:
<?xml version="1.0" encoding="UTF-8" ?>
<tc:tc-config xmlns:tc="http://www.terracotta.org/config"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.terracotta.org/schema/terracotta-5.xsd">
<servers>
<server name="Server1">
<data>/opt/terracotta/server1-data</data>
<l2-group-port>9530</l2-group-port>
</server>
<servers>
...
</tc:tc-config>
The "unreliable" configuration in Terracotta Cluster in Development may be advantageous in development, but if shared in-memory data must be persisted, the server's configuration must be expanded:
<?xml version="1.0" encoding="UTF-8" ?>
<tc:tc-config xmlns:tc="http://www.terracotta.org/config"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.terracotta.org/schema/terracotta-5.xsd">
<servers>
<server name="Server1">
<data>/opt/terracotta/server1-data</data>
<l2-group-port>9530</l2-group-port>
<dso>
<!-- The persistence mode is "temporary-swap-only" by default, so it must be changed explicitly. -->
<persistence>
<mode>permanent-store</mode>
</persistence>
</dso>
</server>
</servers>
...
</tc:tc-config>
The value of the <persistence> element's <mode> subelement is "temporary-swap-only" by default. By changing it to "permanent-store", the server now backs up all shared in-memory data to disk.

If the server is restarted, application state (all clustered data) in the shared heap is restored.
In addition, previously connected clients are allowed to rejoin the cluster within a window set by the <client-reconnect-window> element:
<?xml version="1.0" encoding="UTF-8" ?>
<tc:tc-config xmlns:tc="http://www.terracotta.org/config"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.terracotta.org/schema/terracotta-5.xsd">
<servers>
<server name="Server1">
<data>/opt/terracotta/server1-data</data>
<l2-group-port>9530</l2-group-port>
<dso>
<!-- By default the window is 120 seconds. -->
<client-reconnect-window>120</client-reconnect-window>
<!-- The persistence mode is "temporary-swap-only" by default, so it must be changed explicitly. -->
<persistence>
<mode>permanent-store</mode>
</persistence>
</dso>
</server>
</servers>
...
</tc:tc-config>
The <client-reconnect-window> does not have to be explicitly set if the default value is acceptable. However, in a single-server cluster <client-reconnect-window> is in effect only if persistence mode is set to "permanent-store".
The example illustrated in Fig. 3 presents a reliable but not highly available cluster. If the server fails, the cluster fails. There is no redundancy to provide failover. Adding a standby server adds availability because the standby failover (see Fig. 4).

In this array, if the active Terracotta server instance fails then the standby instantly takes over and the cluster continues functioning. No data is lost.
The following Terracotta configuration file demonstrates how to configure this two-server array:
<?xml version="1.0" encoding="UTF-8" ?>
<tc:tc-config xmlns:tc="http://www.terracotta.org/config"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.terracotta.org/schema/terracotta-5.xsd">
<servers>
<server name="Server1">
<data>/opt/terracotta/server1-data</data>
<l2-group-port>9530</l2-group-port>
<dso>
<persistence>
<mode>permanent-store</mode>
</persistence>
</dso>
</server>
<server name="Server2">
<data>/opt/terracotta/server2-data</data>
<l2-group-port>9530</l2-group-port>
<dso>
<persistence>
<mode>permanent-store</mode>
</persistence>
</dso>
</server>
<ha>
<mode>networked-active-passive</mode>
<networked-active-passive>
<election-time>5</election-time>
</networked-active-passive>
</ha>
</servers>
...
</tc:tc-config>
The recommended <mode> in the <ha> section is "networked-active-passive" because it allows the active and passive servers to synchronize directly, without relying on a disk.
You can add more standby servers to this configuration by adding more <server> sections. However, a performance overhead may become evident when adding more standby servers due to the load placed on the active server by having to synchronize with each standby.
How server instances behave at startup depends on when in the life of the cluster they are started.
In a single-server configuration, when the server is started it performs a startup routine and then is ready to run the cluster (ACTIVE status). If multiple server instances are started at the same time, one is elected the active server (ACTIVE-COORDINATOR status) while the others serve as standbys (PASSIVE-STANDBY status). The election is recorded in the servers’ logs.
If a server instance is started while an active server instance is present, it syncs up state from the active server instance before becoming a standby. The active and passive servers must always be synchronized, allowing the passive to mirror the state of the active. The standby server goes through the following states:
The active server instance carries the load of sending state to the standby during the synchronization process. The time taken to synchronized is dependent on the amount of clustered data and on the current load on the cluster. The active server instance and standbys should be run on similarly configured machines for better throughput, and should be started together to avoid unnecessary sync ups.
If the active fails and more than one standby is present, an election determines the new active. Successful failover takes place only if at least one standby server is fully synchronized with the active server. Shutting down the active server before a fully-synchronized standby is available can result in a cluster-wide failure.
Terracotta server instances acting as standbys can run either in persistent mode or non-persistent mode. If an active server instance running in persistent mode goes down, and a standby takes over, the crashed server’s data directory must be cleared before it can be restarted and allowed to rejoin the cluster. Removing the data is necessary because the cluster state could have changed since the crash. During startup, the restarted server’s new state is synchronized from the new active server instance. A crashed standby running in persistent mode, however, automatically recovers by wiping its own database.
If both servers are down, and clustered data is persisted, the last server to be active should be started first to avoid errors and data loss. Check the server logs to determine which server was last active.
To safely migrate clients to a standby server without stopping the cluster, follow these steps:
The standby server must already be configured in the Terracotta configuration file.
Clients should connect to the new active server.
The previously active server can now rejoin the cluster as a standby server.
For a cluster with persistence, a safe cluster shutdown should follow these steps:
The Terracotta client will shut down when you shut down your application.
To restart the cluster, first start the server that was last active. If clustered data is not persisted, either server could be started first as no database conflicts can take place.
In a Terracotta cluster, "split brain" refers to a scenario where two servers assume the role of active server (ACTIVE-COORDINATOR status). This can occur during a network problem that disconnects the active and standby servers, causing the standby to both become an active server and open a reconnection window for clients (<client-reconnect-window>).
If the connection between the two servers is never restored, then two independent clusters are in operation. This is not a split-brain situation. However, if the connection is restored, one of the following scenarios results:
The cluster can solve almost all split-brain occurrences without loss or corruption of shared data. However, it is highly recommended that after such an occurrence the integrity of shared data be confirmed.
For capacity requirements that exceed the capabilities of an two-server active-passive setup, expand the Terracotta server array using a mirror-groups configuration. Using mirror groups with multiple coordinated active Terracotta server instances adds scalability to server array.
Scalable server arrays are available in enterprise versions of Terracotta.

Mirror groups are specified in the <servers> section of the Terracotta configuration file. Mirror groups work by assigning group memberships to Terracotta server instances. The following snippet from a Terracotta configuration file shows a mirror-group configuration with four servers:
...
<servers>
<server name="server1">
...
</server>
<server name="server2">
...
</server>
<server name="server3">
...
</server>
<server name="server4">
...
</server>
<mirror-groups>
<mirror-group group-name="groupA">
<members>
<member>server1</member>
<member>server2</member>
</members>
</mirror-group>
<mirror-group group-name="groupB">
<members>
<member>server3</member>
<member>server4</member>
</members>
</mirror-group>
</mirror-groups>
<ha>
<mode>networked-active-passive</mode>
<networked-active-passive>
<election-time>5</election-time>
</networked-active-passive>
</ha>
</servers>
...
In this example, the cluster is configured to have two active servers, each with its own standby. If server1 is elected active in groupA, server2 becomes its standby. If server3 is elected active in groupB, server4 becomes its standby. server1 and server2 automatically coordinate their work managing Terracotta clients and shared data across the cluster.
In a Terracotta server array designed for multiple active Terracotta server instance, the server instances in each mirror group participate in an election to choose the active. Once every mirror group has elected an active server instance, all the active server instances in the cluster begin cooperatively managing the cluster. The rest of the server instances become standbys for the active server instance in their mirror group. If the active in a mirror group fails, a new election takes place to determine that mirror group's new active. Clients continue work without regard to the failure.
In a Terracotta cluster with mirror groups, each group, or "stripe," behaves in a similar way to an active-passive setup (see Terracotta Server Array with High Availability). For example, when a server instance is started in a stripe while an active server instance is present, it synchronizes state from the active server instance before becoming a standby. A standby cannot become an active server instance during a failure until it is fully synchronized. If an active server instance running in persistent mode goes down, and a standby takes over, the data directory must be cleared before bringing back the crashed server.
If the active server in a mirror group fails or is taken down, the cluster stops until a standby takes over and becomes active (ACTIVE-COORDINATOR status).
However, the cluster cannot survive the loss of an entire stripe. If an entire stripe fails and no server in the failed mirror-group becomes active within the allowed window (based on HA settings; see Configuring Terracotta For High Availability), the entire cluster must be restarted.
High-availability configuration can be set per mirror group. The following snippet from a Terracotta configuration file shows a mirror-group configured with its own high-availability section:
...
<servers>
<server name="server1">
...
</server>
<server name="server2">
...
</server>
<server name="server3">
...
</server>
<server name="server4">
...
</server>
<mirror-groups>
<mirror-group group-name="groupA">
<members>
<member>server1</member>
<member>server2</member>
</members>
<ha>
<mode>networked-active-passive</mode>
<networked-active-passive>
<election-time>10</election-time>
</networked-active-passive>
</ha>
</mirror-group>
<mirror-group group-name="groupB">
<members>
<member>server3</member>
<member>server4</member>
</members>
</mirror-group>
</mirror-groups>
<ha>
<mode>networked-active-passive</mode>
<networked-active-passive>
<election-time>5</election-time>
</networked-active-passive>
</ha>
</servers>
...
In this example, the servers in groupA can take up to 10 seconds to elect an active server. The servers in groupB take their election time from the <ha> section outside the <mirror-groups> block, which is in force for all mirror groups that do not have their own <ha> section.
See the mirror-groups section in the Configuration Guide and Reference for more information on mirror-groups configuration elements.