This site hosts historical documentation. Visit www.terracotta.org for recent product information.
This page discusses "grey outages" (degraded characteristics) as well as "black and white" failures.
L2 Active Log = WARN tc.operator.event -NODE : Server1 Subsystem: CLUSTER_TOPOLOGY Message:Node ClientID[0] left the cluster
When= immediately from the loss of PID
Limit (with default values)= 0 seconds
For Reconnect properties enabled:
L2 Active Log = (same) When = after [l2.l1reconnect.timeout.millis] from the loss of PID
15 secs (L2-L1 reconnect)
recycle client JVM
L2 Active Log = WARN tc.operator.event - NODE : Server1 Subsystem: HA Message: Node ClientID[0] left the cluster
When= after ping.idletime + (ping.interval * ping.probes) + ping.interval
Limit (with default values)= 4 - 9 seconds
Limit is a measure of the time in which the process determines the case. Why it is a limit and not an absolute value?
This is because that there is a possibility that when the system encountered the problem it could be in one of the two states below.
State 1: All the components were in continuous conversation and thus the Health Monitoring has to factor in the first ping.idletime as a measure of detection of the problem.
State 2: All the components were connected to each other but the application load or the communication was such that there was a communication silence > ping.idletime. This means, that the system was doing Health Monitoring in the background already and the cluster was detected healthy at all times before this new problem arrived.
Therefore, it is possible that you may see the detection time as an interval within this limit.
All the expressions from here on show the maximum time it can take inclusive of the limiting ping.idletime. To get the limit interval just deduct the ping.idletime from the equations.
For Reconnect properties enabled:
L2 Active Log = (same)
When = after [l2.l1reconnect.timeout.millis] from the loss of PID
15 secs (L2-L1 reconnect)
Start Client-JVM after machine reboot (On Restart client rejoins the cluster.)
text
text
text
Terracotta code should execute without any impact, except nothing will be logged in log files.
Whether application threads are able to proceed, as their ability to write to disk (e.g. logging) will be hampered.
text
As soon as disk usage falls back to normal.
Cleanup local disk to resume Terracotta Client Logging
Slow down in TPS at admin console because L1 will not be able to release resources (e.g. Locks) faster and the Terracotta Server Array (L2) will take more time to commit the transaction that are to applied on this L1
TPS recovers when CPU returns to normal. Run tests with difference intervals of high CPU usage (15s, 30s, 60s, 120s, 300s)
text
As soon as CPU usage returns back to normal.
Analyze Root Cause and remedy.
Slow down/Zero in TPS at admin console because any resource (e.g. Locks) held by L1 will not released until GC is over and terracotta server (L2) will not able to commit transactions that are to be applied on this L1.
Case1: Full GC cycle less than 45 secs No message in L1/L2 logs. Admin console reflects normal TPS once L1 recovers from GC.
Case 2: Full GC cycle > 45 secs
After 45 secs, L2 health monitoring declares L1 dead and prints this message in L2 logs 'INFO com.tc.net.protocol.transport.ConnectionHealthCheckerImpl.TCGroupManager - L1:PORT is DEAD'
After 45 secs, primary L2 ejects the L1 from cluster and prints 'shutdownClient() : Removing txns from DB :' in L2 logs.
Once L1 is ejected Admin console does not show the failed L1 in client list.
If the L1 recovers after 45 secs and tries to reconnect, the L2 refuses all connections and prints this message in L2 logs:
INFO com.tc.net.protocol.transport.ServerStackProvider - Client Cannot Reconnect ConnectionID(
L2 Active Log = WARN com.tc.net.protocol.transport.ConnectionHealthCheckerImpl.TCGroupManager - 127.0.0.1:56735 might be in Long GC. Ping-probe cycles completed since last reply : 1 .... .... INFO com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TCGroupManager - localhost:56735 is DEAD
When= Detection in ping.idletime + (ping.interval * ping.probes) + ping.interval Disconnection in (ping.idletime) + socketConnectCount * [(ping.interval * ping.probes) + ping.interval] Limit (with default values)= detection in 4 - 9 seconds, disconnection in 45 seconds
Max allowed GC time at L1 = 'L2-L1 Health monitoring ' = 45 secs.
Analyze GC issues and remedy.
Slow down/Zero in TPS at admin console because any resource (e.g. Locks) held by L1 will not released and terracotta server(L2) will not able to commit transactions that are to be applied on this L1.
Case 1: Client host fails over to standby NIC within 14 seconds. No message in L1/L2 logs. TPS resumes to normal at admin console as soon L1 NIC is restored.
Case 2: Client host fails over to standby NIC after 14 seconds -
After 14 secs, Terracotta Server Array health monitoring declares L1 dead and prints this message in L2 logs 'INFO com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. DSO Server - L1 IP:PORT is DEAD'
After 14 secs seconds primary L2 ejects the L1 from cluster and prints 'shutdownClient() : Removing txns from DB :' in L2 logs.
Once L1 is ejected Admin console does not show the failed L1 in client list.
If the L1 recovers after 14 secs and tries to reconnect, the L2 doesn't allow it to reconnect and prints this message in L2 logs:
INFO com.tc.net.protocol.transport.ServerStackProvider - Client Cannot Reconnect ConnectionID(
L1 Log = INFO com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. DSO Client - Socket Connect to indev1.terracotta.lan:9510(callbackport:9510) taking long time. probably not reachable. [HealthChecker] INFO com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. DSO Client - indev1.terracotta.lan:9510 is DEAD L2 Active Log = INFO com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. DSO Server - Socket Connect to pbhardwa.terracotta.lan:52275(callbackport:52274) taking long time. probably not reachable. [HealthChecker] INFO com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. DSO Server - pbhardwa.terracotta.lan:52275 is DEAD
When= Detection in ping.idletime + (ping.interval * ping.probes) + ping.interval Disconnection in (ping.idletime) + (ping.interval * ping.probes + socketConnectTimeout * ping.interval) + ping.interval Limit (with default values)= detection in 4 - 9 seconds, disconnection in 14 seconds
Max allowed NIC recovery at L1 = 'L2-L1 Health monitoring = 14 secs.
No action needed immediately. At some point fix failed NIC.
Terracotta code should execute without any impact, except nothing will be logged in log files.
If application threads are able to proceed as their ability to write to disk (e.g. logging) will be hampered
text
If switch fails such that primary L2 (of a mirror-group) is unreachable from hot-standby L2 of the same mirror group and all L1s.
If Switch fails such that primary and hot-standby L2 connectivity is intact, while L1s connectivity with primary L2 is broken - Zero TPS at admin console - Max allowed Recovery time from switch failure = L2-L2 Health Monitoring = 14 secs. - If failover to redundant switch occurs within 14 secs, cluster topology remains untouched and TPS resumes to normal at admin console after switch recovery - If switch does not failover within 14 secs, L2 quarantines all the L1s from cluster. After switch recovery all the L1s have to be restarted to make them rejoin the cluster.
No action needed immediately. Restore Switch at a later point.
Expected behavior - Zero TPS at admin console
Case 1: TC server host fails over to standby NIC within 14 seconds - TPS resumes on admin console as soon as NIC recovery happens.
Case 2: TC server host does not fail over to standby NIC within 14 seconds, - After 14 secs all L1s disconnect from primary L2 and try connection with hot-standby L2. - After 14 secs, hot-standby L2 starts election to become primary and prints 'Starting Election to determine cluster wide ACTIVE L2' inside its logs. - After 19 secs, hot-standby L2 becomes primary L2 and prints 'Becoming State[ ACTIVE-COORDINATOR ]' inside its logs - Once hot-standby L2 becomes the primary, all L1s will reconnect to it. Cluster recovers when new primary log prints 'Switching GlobalTransactionID Low Water mark provider since all resent transactions are applied' and TPS resumes at admin console. - Once the old primary L2 recovers, it is zapped by the new primary L2.
L1 Log = INFO com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. DSO Client - Socket Connect to indev1.terracotta.lan:9510(callbackport:9510) taking long time. probably not reachable. [HealthChecker] INFO com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. DSO Client - indev1.terracotta.lan:9510 is DEAD L2 Passive Log = INFO com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. DSO Server - Socket Connect to pbhardwa.terracotta.lan:52275(callbackport:52274) taking long time. probably not reachable. [HealthChecker] INFO com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. DSO Server - pbhardwa.terracotta.lan:52275 is DEAD
When= Detection in ping.idletime + (ping.interval * ping.probes) + ping.interval Disconnection in (ping.idletime) + (ping.interval * ping.probes + socketConnectTimeout * ping.interval) + ping.interval Passive become active in (ping.idletime) + (ping.interval * ping.probes + socketConnectTimeout * ping.interval) + ping.interval + Election Time Limit (with default values)= detection in 4 - 9 seconds, disconnection in 14 seconds, passive takes over in 19 seconds
Max allowed recovery time = 'min ( (L2-L1 health monitoring (14 secs)), (L1-L2 health monitoring (14 secs)), (L2-L2 health monitoring (14 secs)) )' = 14 secs.
The complete recovery time will be more than 19 secs and exact time will depend on cluster runtime condition. Ideally cluster should recover completely within 25 seconds.
No action needed immediately. At some point FIX the failed NIC by after forcing a failover to the standby Terracotta Server.
Zero TPS at admin console After 15 seconds, hot-standby L2 starts election to become the primary and print 'Starting Election to determine cluster wide ACTIVE L2' inside its logs. All L1 disconnects from primary L2 after 15 secs and connect to old hot-standby L2 when it becomes primary. After 20 secs hot-standby becomes primary and prints 'Becoming State[ ACTIVE-COORDINATOR ]' inside its logs Once hot-standby L2 becomes primary, all L1 will reconnect to hot-standby. Cluster recovers when new primary log prints 'Switching GlobalTransactionID Low Water mark provider since all resent transactions are applied' and TPS resumes at admin console.
L2 Passive Log = WARN tc.operator.event - NODE : Server1 Subsystem: CLUSTER_TOPOLOGY Message: Node Server2 left the cluster When = Immediately when PID exits L1 Log = INFO com.tc.net.protocol.transport.ConnectionHealthCheckerImpl: DSO Client - Connection to [localhost:8510] DISCONNECTED. Health Monitoring for this node is now disabled. When= Immediately when PID exits Limit = Detection Immediate, L2 PassiVe takes over as Active after Election Time (default = 5 seconds) For Reconnect properties enabled: L2 Passive Log = (same) When = after [l2.nha.tcgroupcomm.reconnect.timeout] from the loss of PID
The complete recovery time will be more than 20 secs (L2-L2 reconnect + Election time) and exact time will depend on cluster runtime condition. Ideally cluster should recover completely within 25 seconds.
No action needed immediately (given failover). Restart L2 (it will now become the hot-standby).
Clients fail over to Hot-standby L2, which then becomes primary. Once Primary L2 comes back (i.e. is restarted after the machine reboot sequence), it will join the cluster as hot-standby.
text
Same as F9
No action needed immediately. Restart L2 after reboot (it will now become the hot-standby)
Zero TPS at admin console After 14 seconds, hot-standby starts election to become primary and print 'Starting Election to determine cluster wide ACTIVE L2' inside its logs. All L1 disconnects from primary L2 after 14 secs and connect to old hot-standby L2 when it becomes primary.
After 19 secs hot-standby becomes primary and prints 'Becoming State[ ACTIVE-COORDINATOR ]' inside its logs Once hot-standby L2 becomes primary, all L1s will reconnect to hot-standby. Cluster recovers when new primary log prints 'Switching GlobalTransactionID Low Water mark provider since all resent transactions are applied' and TPS resumes at admin console.
text
L1 will detect failure after (L1-L2 health monitoring (14 secs)) Hot-standby L2 will detect failure after (L2-L2 health monitoring (14 secs) The complete recovery time will be more than 19 secs (L2-L2 health monitoring (14 secs) + Election time (5 secs) and exact time will depend on cluster runtime condition. Ideally cluster should recover completely within 25 seconds.
Same as F10 (a)
Same as F9
15 secs.
text
No action needed immediately (given failover to Hot-standby L2) Clean up disk and restart services (this L2 will now be hot-standby)
Slow down in TPS at admin console because L2 will take more time to process transactions TPS recovers when CPU returns to normal. Run tests with difference intervals of high CPU usage (15s, 30s, 60s, 120s, 300s)
text
As soon as CPU usage returns back to normal
Root-cause analysis and fix (Thread dumps needed if escalated to TC)
Zero TPS at admin console as primary L2 cannot process any transaction.
Case 1: GC cycle < 45 secs - L1 and hot-standby L2 log will display 'WARN com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TCGroupManager - L2 might be in Long GC. GC count since last ping reply :', if L2 is in GC for more than 9s. - TPS returns to normal at admin console as soon as primary L2 recovers from GC
Case 3: GC cycle > 45s - After 45s, hot-standby L2 declares primary L2 dead. - Hot-standby L2 prints 'HealthChecker] INFO com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TCGroupManager - L2 IP: PORT is DEAD' in it logs message. - After 45 secs, hot-standby starts election to become primary and print 'Starting Election to determine cluster wide ACTIVE L2' inside its logs. - After 50 secs hot-standby becomes primary and prints 'Becoming State[ ACTIVE-COORDINATOR ]' inside its logs.
L1 Log = WARN com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. DSO Client - localhost:9510 might be in Long GC. GC count since last ping reply : 1 ... ... ... But its too long. No more retries [HealthChecker] INFO com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. DSO Client - localhost:9510 is DEAD When = Detection in ping.IdleTime + l1.healthcheck.l2.ping.probes* ping.interval + ping.interval Dead in ping.IdleTime + l1.healthcheck.l2.socketConnectCount * (l1.healthcheck.l2.ping.probes * ping.interval + ping.interval) Limit = Detect in 5-9 seconds, Dead in 57 L2 Passive Log = WARN com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. DSO Client - localhost:9510 might be in Long GC. GC count since last ping reply : 1 ... ... ... But its too long. No more retries [HealthChecker] INFO com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. DSO Client - localhost:9510 is DEAD When = Detection in ping.IdleTime + l2.healthcheck.l2.ping.probes* ping.interval + ping.interval Dead in ping.IdleTime + l2.healthcheck.l2.socketConnectCount * (l2.healthcheck.l2.ping.probes * ping.interval + ping.interval) Limit = Detect in 5 -9 seconds, Dead in 45 L2 passive takes over as Active after Dead Time + Election Time
Max allowed GC time = 'min ((L1-L2 health monitoring (57 secs) )), (L2-L2 health monitoring(45 secs)))' = 45 secs The max complete recovery time will be more than 57 secs and exact time will depend on cluster runtime condition. Ideally cluster should recover completely within 65 seconds.
Root cause analysis to avoid this situation (e.g. more Heap, GC Tuning, etc. based on what the root-cause analysis dictates).
Slow/Zero TPS at admin console as primary L2 can not commit the transactions to hot-standby L2 Primary L2 prints 'Connection to [Passive L2 IP:PORT] DISCONNECTED. Health Monitoring for this node is now disabled.' In logs as soon as hot-standby L2 fails After 15s, primary L2 quarantines hot-standby L2 from cluster, prints 'NodeID[Passive L2 IP:PORT] left the cluster' in logs and TPS returns to normal at admin console
L2 Active Log = WARN com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. DSO Client - localhost:9510 might be in Long GC. GC count since last ping reply : 1 ... ... ... But its too long. No more retries [HealthChecker] INFO com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. DSO Client - localhost:9510 is DEAD When = Detection in ping.IdleTime + l2.healthcheck.l2.ping.probes* ping.interval + ping.interval Dead in ping.IdleTime + l2.healthcheck.l2.socketConnectCount * (l2.healthcheck.l2.ping.probes * ping.interval + ping.interval) Limit = Detect in 5 -9 seconds, Dead in 45
Recovery Time = [L2-L2 Reconnect] = 15 secs
Restart hot-standby L2 (blow away dirty BDB database before restart) in case of PID failure /Host Failure
Slow/Zero TPS at admin console as primary L2 can not commit the transactions to hot-standby L2 After 14 secs, Primary L2 prints 'Connection to [Passive L2 IP:PORT] DISCONNECTED. Health Monitoring for this node is now disabled.' in logs. After 14 secs, Primary L2 quarantines hot-standby L2 from cluster, prints 'NodeID [Passive L2 IP:PORT] left the cluster' in logs and TPS returns to normal at admin console
text
Recovery Time = [L2-L2 health monitoring] = 14 secs
Restart hot-standby L2 (blow away dirty BDB database before restart - not needed in case of 2.7.x or above) in case of PID failure /Host Failure
Slow/Zero TPS at admin console as primary L2 can not commit the transactions to hot-standby L2 Case 1: Hot-standby-L2 host failover to secondary NIC within 14 secs - No impact on cluster topology. TPS at admin console resumes as soon as NIC is restored at hot-standby L2 Case 2: Secondary host does not failover to standby NIC in 14 secs - After 14 secs Primary L2 prints 'Connection to [indev2.terracotta.lan:46133] DISCONNECTED. Health Monitoring for this node is now disabled.' - After 14 secs, Primary L2 quarantines hot-standby L2 from cluster, prints 'NodeID[Passive L2 IP:PORT] left the cluster' in logs and TPS returns to normal at admin console
L2 Active Log = INFO com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TCGroupManager - Socket Connect to indev1.terracotta.lan:8530(callbackport:8530) taking long time. probably not reachable. INFO com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TCGroupManager - indev1.terracotta.lan:8530 is DEAD When = Detection in ping.idletime+ ping.probes* ping.interval + ping.interval Dead in ping.idletime + ping.probes* ping.interval + l2.healthcheck.l2.socketConnectTimeout* ping.interval Limit = 9 -14 seconds (with default values)
Recovery Time = [L2-L2 heath monitoring] = 14 secs
If quarantined from cluster, Restart hot-standby L2 (blow away dirty BDB database before restart - not needed for 2.7.x) in case of PID failure /Host Failure
Slow TPS at Admin-Console as primary L2 takes more time to commit transaction at hot-standby L2 TPS recovers when CPU returns to normal. Run tests with difference intervals of high CPU usage (15s, 30s, 60s, 120s, 300s)
text
Recovers as soon as CPU becomes normal
Analyze root-cause and resolve high-CPU issue at Hot-standby L2.
Slow/Zero TPS at admin console as primary L2 can commit transaction locally but cannot commit transaction at hot-standby L2
Case 1: GC cycle < 45 secs - Primary L2 log will display 'WARN com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TCGroupManager - L2 might be in Long GC. GC count since last ping reply :', if L2 is in GC for more than 9 secs. - TPS returns to normal at admin console as soon as hot-standby L2 recovers from GC
Case 2: GC cycle > 45 secs - After 45 secs primary L2 health monitoring declares hot standby L2 dead. - Primary L2 prints 'HealthChecker] INFO com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TCGroupManager -[Passive L2 IP: PORT] is DEAD' in it logs message - After 45 seconds, primary L2 quarantines hot-standby L2 from cluster, prints 'NodeID[Passive L2 IP:PORT] left the cluster' in logs and TPS returns to normal at admin console.
L2 Active Log = WARN com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. DSO Client - localhost:9510 might be in Long GC. GC count since last ping reply : 1 ... ... ... But its too long. No more retries [HealthChecker] INFO com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. DSO Client - localhost:9510 is DEAD When = Detection in ping.IdleTime + l2.healthcheck.l2.ping.probes* ping.interval + ping.interval Dead in ping.IdleTime + l2.healthcheck.l2.socketConnectCount * (l2.healthcheck.l2.ping.probes * ping.interval + ping.interval) Limit = Detect in 5 -9 seconds, Dead in 45
Max Recovery Time = [L2-L2 health monitoring] = 45 secs
Root-cause analysis and fix needed for Memory on L2 getting pegged - (the actual action to be taken is fairly varied in this case, depending on the symptoms and analysis)
Same as F14.
text
Same as F14.
Hot-standby process dies with BDB errors. Restart hot-standby l2.
All application threads that need DSO lock from the TC Server or those that are writing "Terracotta-transactions" with transaction buffer full will block. Once mirror-group(s) is restored, all the L1s connected to it before failure will reconnect and normal application activity resumes.
text
Depends on when the Terracotta Server Array is recycled.
Not designed for N+1 failure. Restart mirror group(s) primary and hot-standby after collecting artifacts for root-cause analysis.
Not designed for N+1 failure. Restart mirror group(s) primary and hot-standby after collecting artifacts for root-cause analysis.
text
Minutes
Once Data-Center is restored, restart Terracotta Server Array. Then restart L1 Nodes. Cluster state will be restored to point of outage.