Goals
Currently for high-availability, L2s can run in active-passive mode but they need a shared storage that is accessible to both active and passive nodes. The disk-share is used to share data between the active and passive servers. It is also used to elect an active L2 server using file locks.
While this setup works well, it puts an extra requirement on the users of Terracotta DSO to set up some shared storage system (SAN based or NFS based etc.) to get high-availability.
The goal of this feature is to remove that requirement and still be able to give highly available, redundant TC servers without compromising too much of performance.
(Jira ticket: https://jira.terracotta.org/jira/browse/CDV-38
)
Requirements
Must haves
- Users should be able to start L2s in active-passive mode without having a commonly visible disk share.
- Users should be able to run in both persistence and non-persistence mode.
- L2s should automatically elect an active on startup and on failover.
- Users should be able to start multiple Passive servers. Since this affects performance it is good to limit this to 3 or 4
- Users should be able to bring up or bring down Passive servers anytime. Though bringing up a passive server after creating a lot of shared objects requires it to sycn all that data from the active server before it can become one of the available passive server for switch over. The time taken for it to become available is a factor of the amount of data and the load on the cluster.
- On failover all the connected L1s should automatically connected to the current active server.
- Clear messages should be printed in the console to tell the operator when a passive joins, when it becomes available and when a switch over happens.
Optional
- Users should be able to force a switch-over to a particular passive server for maintainence reasons.
- There should be only 10-20% degradation in the performance when running in Active-Passive without disk share mode comapared to active-passive with a disk share.
- Detect and avoid split brain problem as much as possible.
- Admin console should be able to handle the switch-over seemlessly.
High Level Design
- DSO L2 startup sequence and comms code will need to change to handle multiple servers starting up and connecting to each other. The config system need not change.
- An election policy will be used to elect an active server from the list of servers.
- Lock Manager will reside with just the active server
- As transactions come from the L1s to the active server, they will be passed to all the passive servers to process.
- An ack is sent back to the originating client only when all passive servers applied and acked the transaction.
- By using the same transactions arriving from the clients, and using the same SEDA queues for processing in the passive, we get the benifit of all the performance enhancements that we did in the past.
- When a passive server joins a cluster, initially it looks up all the objects in phases from the active L2. While this is going on, only transactions containing changes of objects that are already available to the passive L2 is sent across.
- Whenever a passive starts up it always get current state from the active and discards any old state from the disk.
- Separate SEDA stages will be created for parallelizing these activies.
- When a fail-over happens, one of the available passives will be elected by the cluster to become active.
- L1s connect to the new active server and retransmit the state (lock, resend transactions and object account)
- On distributed gc, the list of garbage should be sent to all the passive servers to delete from the objectmanager.
Future improvements
- Passive servers on start up could intellegently sync up with active server only those objects that were changed during the time it was down.
- L1s could directly send transactions to active and passives to avoid the network latency.
- GC could be run in one of the passive server when possible, to take some load of the active server.
High Level Tasks And Status
- Comms Multicast/Unicast manager
- Election Algorithm
- Cluster ID
- Stages for txn propagation
- Stages for Object lookup/retrieval
- Stages for Object Delete notification
TODOs :
- change config to add interface + port for server-server communication.
- Enhance TC Comms to do Group communication.
- Cluster wide cluster ID
- Global Transaction ID needs to sent across
- Persistence mode - active comes back up, can become active only if cluster didn't go ahead after the crash. Else should truncate db.
Repository
Currently this work is done in https://svn.terracotta.org/repo/tc/dso/branches/active-passive-diskless