| Terracotta Documentation Terracotta documentation is categorized into DSO documentation and product documentation. Other Terracotta documentation includes whitepapers and introductory tutorials. Note the following links: • New Documentation Home Page – Introduction and links to Terracotta documentation.• Product Documentation – Covers Terracotta products and related topics:
|
|
Operations GuideIntroductionLiving successfully with a Terracotta deployment, as with any production deployment, requires developing and rigorously adhering to an operating plan. This plan should include a runbook that describes what steps to take for all routine operations on your cluster as well as how to monitor the health of your cluster and take action against unplanned events, such as hardware failures, load spikes, and the like.
Set Up MonitoringTerracotta provides a wealth of information about the current state of your cluster. Before deployment, you should have a plan for what to monitor and how you are going to integrate it into your network operations center. There are a few different ways to retrieve this information for your purposes. The Terracotta Developer ConsoleThe Terracotta Developer Console provides a graphical view of all of the runtime information available from Terracotta. The Developer Console is especially useful during development and testing.
Enterprise versions of Terracotta also include the Terracotta Operations Center, a GUI operator's console offering features such as backups of shared data, client disconnect, and server shutdown controls. To learn more about the many benefits of an enterprise version of Terracotta, see our [Enterprise Products|enterprise:Products].
The Terracotta JMX InterfaceThe Terracotta JMX Interface can be used to retrieve all of the data presented by the Terracotta Developer Console and Operations Center. Additionally, because it uses the standard the JMX protocol, it can be integrated into your existing monitoring and management infrastructure. Likewise, it can be used to create alerts when values cross certain tolerance thresholds. The Terracotta Cluster Statistics Recorder and Snapshot Visualization ToolIn addition to the runtime statistics offered via JMX and the Developer Console, Terracotta offers highly detailed statistics that may be recorded, stored into a database, and viewed using the Terracotta Snapshot Visualization Tool (SVT). Because gathering these more detailed statistics can add significant load to the Terracotta server array, they are not available as always-on runtime statistics. You should, therefore, be judicious in your gathering of these statistics. For more information on using the Terracotta Cluster Statistics Recorder and the Snapshot Visualization Guide, see the Platform Statistics Recorder Guide and the Developer Console. Baselines And Tolerances: Understanding How Your Application BehavesDuring your performance and destructive testing phase, you should be gathering baselines for the relevant statistics that you are going to monitor. You should also establish tolerance bands for them so that you know when the health of your cluster is degrading. Once in production, you should maintain a regular schedule of baselining your application and re-evaluating your tolerances and adjusting your runbook accordingly. You must also do this during major software releases, as well. Baselining and tolerance setting is not something you can just do once and forget about. It is an ongoing process of adapting to changing workload characteristics and changing software. The better you understand how your application behaves under a wide range of conditions, the better equipped you will be to handle critical situations as they arise. Create A Dashboard View Of Your Runtime StatisticsYou should create a dashboard view of all of the most important statistics that indicate the health of your cluster. This will let you see at a glance when your application health starts to operate outside tolerances. Monitoring for IssuesThe following section describes various issues that may arise during the ongoing operation of your cluster, how to monitor and alert for them, and what corrective actions to take. You can use this section as the starting point for your cluster operations runbook. Issue: Disk FullDescriptionIf a file system that a Terracotta server instance writes to is full, that server instance will throw an exception. If the file system in question is only used for logging, logging will stop, but the cluster will continue to operate. If the file system in question contains the Terracotta server instance's data store (the Sleepycat database), the server instance will fail. In this case, a Terracotta cluster with more than one server instance will continue operation. Symptoms/ManifestationThe system logs (for example, /var/log/messages or /var/adm/messages) will contain a "disk full" message. If the file system containing a Terracotta server instance's data store is full, the server log will contain an exception similar to the following: TerracottaServerLog: [WorkerThread(commit_changes_stage,2)] ERROR com.tc.server.TCServerMain - Thread:Thread[WorkerThread(commit_changes_stage,2),5,TC Thread Group] got an uncaught exception. calling CallbackOnExitHandlers. Environment invalid because of previous exception: com.sleepycat.je.RunRecoveryException: (JE 3.2.76) . IOE during Write. Caused by: java.io.IOException: No space left on device MonitoringMonitor disk usage using your standard system monitoring tools. Suggested Tolerances
Actions
Issue: Disk I/O Throughput DegradationSymptoms/ManifestationHigh disk access latency. Potential CausesA Terracotta server instance keeps as much clustered object data as possible in a cache in memory. However, if a client requests a shared object not present on heap, a server instance will read that object's data from disk. A high cache miss rate in a server instance will have a negative impact on performance. Another potential cause is excessive clustered object garbage. Monitoring
ActionsIncrease the heap and the cache size in Terracotta server instances with disk I/O throughput degradation. If excessive garbage is present, try reducing the amount of short-lived clustered objects being created. Issue: Terracotta Server Instance Low On MemoryDescriptionFor a number of reasons, a Terracotta server instance may run low on memory. See the Tuning Guide for details. Symptoms/ManifestationLow memory in a Terracotta server instance is indicated by:
MonitoringThere are a number of ways to monitor a Terracotta server instance's heap usage:
ActionsYou may need to increase the heap size of Terracotta server JVMs that run low on memory. Additionally, you may need to tune the Terracotta virtual memory manager as described in the Tuning Guide. Issue: Terracotta Server CPU Utilization Too HighSymptoms/ManifestationIf a Terracotta server instance's CPU utilization is high, you may see a dip in the transaction rate reported by that server instance; you may also see a dip in the transaction rate of the cluster as a whole. MonitoringMonitor Terracotta server CPU utilization using your standard system monitoring tools or through the Terracotta JMX interface. Make sure you monitor the utilization of all processors/cores. Suggested Tolerances
ActionsIf not all processors/cores are being utilized equally, you may have a hardware or operating system issue, in which case you should consult your hardware or software manufacturer. If all processors/cores are being utilized equally, you may need to increase the CPU capacity of the Terracotta server machine. Issue: Terracotta Server CPU FailureSymptoms/ManifestationThe kernel should panic and the machine will probably go into a reboot sequence. You may also see messages in the operating system logs (for example, /var/log or /var/adm). MonitoringMonitor the health of the Terracotta server machine's CPU using your standard system monitoring tools. ActionsIn the event of a kernel panic, a Terracotta server instance will fail over to a standby Terracotta server instance called a "hot standby," or simply have its tasks continued by another active server instance in the Terracotta server array. If using a hot standby, the cluster will automatically reconnect to the new active Terracotta server instance and resume normal operations. You can restore your cluster to a high availability configuration by deploying a new hot standby. Issue: Terracotta Client CPU FailureSymptoms/ManifestationLike a failure of a Terracotta server CPU, the kernel should panic and the machine will probably go into a reboot sequence. When it is disconnected from the Terracotta server array, it will be automatically removed from the cluster and any cluster resources it held will be reclaimed. MonitoringMonitor the health of a Terracotta client machine's CPU using standard system monitoring tools. If you are using a load balancer, it should also be configured to detect a cluster node failure. ActionsIf you are using a load balancer, the load balancer should automatically detect the failure and rebalance load across the other cluster nodes. To re regain capacity, you should deploy a replacement machine. Issue: Terracotta Server Array Distributed Garbage Collection Performance DegradationDescriptionUnder certain conditions, the Terracotta Distributed Garbage Collector's performance may degrade. For more information on this issue, see the Tuning Guide. Symptoms/ManifestationIf the Terracotta Distributed Garbage Collector (DGC) falls behind, you will see the following:
MonitoringUse the Terracotta Developer Console or the Terracotta JMX interface to monitor the following:
Suggested TolerancesThe DGC tolerances will vary depending on your application, but as a rule of thumb, DGC cycle times over 30 seconds and pause times over 10 seconds are a cause for concern. ActionsSee the Tuning Guide for information on tuning the DGC. Issue: Terracotta Server Instance Object Cache Hit Rate DegradationDescriptionA Terracotta server instance keeps object data in a memory cache for fast access. If the active set of objects doesn't fit in the server instance's memory cache, you may experience performance degradation. Symptoms/ManifestationPoor cache hit rate in the Terracotta server instance is indicated by:
Monitoring
ActionsIncrease the heap and the cache size in Terracotta server instances with disk I/O throughput degradation. If excessive garbage is present, try reducing the amount of short-lived clustered objects being created. Issue: Application DeadlockDescriptionAn application concurrency bug may lead to a cluster-wide deadlock. Sometimes, however, what appears to be a deadlock may actually be a very slow operation. MonitoringYou should have a health check specific to your application that alerts your operations center when there is a degradation in throughput outside certain tolerances. In addition, you may also use the transaction rate form the Terracotta JMX interface or the Terracotta Developer Console as a proxy for cluster throughput. Actions
Cluster EventsDescriptionThe Terracotta JMX interface includes event notifications for cluster events. A node coming online or going offline will fire an event that you can use to monitor the disposition of the cluster. See the JMX Guide for details. Upgrading To A New Version Of Terracotta or JavaThe safest way to upgrade to a new version of Terracotta or JDK is to take the cluster down and upgrade all Terracotta server instances and clients (application servers) before restarting the cluster. NOTE: THIS ASSUMES THAT THE TERRACOTTA DATA FORMAT STORED IN THE UNDERLYING STORE IS COMPATIBLE BETWEEN THE TWO RELEASES. In the odd-case that it is not, you must remove the data files before restarting the upgraded Terracotta server instance or point the upgraded server instance to an empty directory location in the Terracotta configuration file (by default, tc-config.xml). ALWAYS MAKE SURE THAT YOU TEST THIS PROCEDURE in a staging environment before performing it in production. Unknown macro: {enterprise-feature}
See the [Terracotta product matrix] for Terracotta products that support rolling (live) upgrades. Deploying A New Version of Your Application SoftwareThe safest way to upgrade to a new version of your application is to take the down and upgrade all Terracotta clients (application servers). Restart the clients after they are upgraded.
Upgrading your cluster to a new version of your application software can be done in a phased approach. One part of the cluster is taken out of service for upgrading, then returned to service while the next part is taken out of service and upgraded, then returned to service, and so on.
The steps are as follows:
Class Schema Change ConsiderationsTerracotta does not use Java serialization to share object data, so upgrading your application code is not subject to the same class-versioning limitations that serialization imposes. However, there are some considerations that you must make when upgrading your Terracotta-enabled code.
Custom Object Data TranslationWhile there is no direct support for modifying the class schema in the Terracotta object database, it is possible to create a custom object data translator that will convert an object graph from one format to another. One possible approach is to write code that walks an existing object graph, copying values from that existing object graph to a new object graph with the new class format. Once the translation has occured, code using the new class format would then use the new object graph. The old object graph will fall out of scope and become garbage collected. ALWAYS MAKE SURE THAT YOU TEST ALL UPGRADE PROCEDURES in a staging environment before performing them in production. Unknown macro: {enterprise-feature}
Patching an Installed Version of TerracottaYour Terracotta installation may need a fix between point releases. [Terracotta software patches] provide a low-impact software upgrade. Back Up the Cluster DatabaseThe Terracotta cluster has a BerkeleyDB database containing all of the object data for the Terracotta virtual heap as well as some metadata about the state of the cluster. Unknown macro: {enterprise-feature}
A one-step database backup button is available. The backup captures a full snapshot of the database. See the Terracotta Operations Center for more information on backing up the database. |