Object Level Fault Tolerance for CORBA-based Distributed Computing

By Tom McDonough, Director of Professional Services, Semaphore.

Service extensions to a CORBA ORB (i.e., Replica Manager and Replica Service Agents) and ORB specific functionality (i.e., IIOP and Interceptors can help achieve Object level fault tolerance).

Many development activities overdesign and develop fault tolerant functionality that could have been addressed through CORBA specific features, needlessly expending scarce project resources. Some projects incorrectly use CORBA features to implement fault tolerant requirements and find that the distributed applications didn’t function properly in the presence of many fault conditions.

You want your system to maintain a set pattern of behavior at its interface with its environment and users, according to your requirements. They don’t really care how the system does it. But you must verify that it meets this goal.

Objects that represent underlying business semantics are quickly replacing disparate processes and global data constructs. Object-to-object method invocations are supplanting traditional process-oriented remote procedure calls and object-based data encapsulation is superceding local data stores. CORBA-based distributed computing architectures are becoming more prevalent in commercial and information technology development initiatives. Object-oriented middleware can be a standard interconnecting mechanism for distributed business objects.

Failures and Faults

A failure is an external symptom of a fault. A system fails if the service it delivers to the user deviates from compliance with the system specifications. A fault is a defect with the potential to cause a failure. While it has this potential, it may never actually do so. A failure is a guarantee that a fault has occurred.

Faults are either transient or permanent. Transient faults have limited life span and may be a temporary system failure cause. They can be very difficult to detect and correct. Permanent faults can cause a non-recoverable failure in a component. These faults are the easiest type to detect and diagnose.

In real, complex systems faults can propagate; that is, interplay with other unrelated faults or failures. Dormant faults can be triggered into activity, then propagate others in a trajectory through a system.

Latent faults, undetectable because they’ve never caused an observable failure, can lie in ambush until they’re triggered.

Setting a Goal

The fault tolerance goal is to prevent system failure even in the presence of a fault in one or more of its subsystems or components.

System components can be made fault tolerant, but an entire system can’t be fault tolerant against its own failures. That requires one or more available secondary subsystems to subsume the failed system’s functionality.

Replication Schemes

Redundancy is vital, to support fault tolerance. This requires replication. There are three service redundancy replication schemes:

Hot Backup – extra copies of a service are active, processing client requests and synchronizing internal states with every other peer service. If a primary service fails any hot backup service can become the primary service provider.
Warm Backup –backup copies of a service run while the primary service is running, but don’t receive or respond to client requests. The primary service periodically synchronizes its state with the warm backup services. If the primary service fails, a warm backup service is elected as the new primary service. It resumes operations from the point of the latest synchronization.
Cold Backup – no backups of the service are active when the primary service is running. The primary service may periodically checkpoint its state on stable storage, but doesn’t synchronize with any of the cold backups. Upon primary service failure, a cold backup is started as the new primary.

Replication Schemes

System Topology

In order to support a CORBA-based distributed computing fault tolerant system, you must first understand its physical and logical topologies.

Physical Topology

Physical topology describes how nodes (i.e., processors) interconnect through physical communications channels. In distributed systems, nodes are loosely coupled, sharing neither common memory nor a clock. If processes are to communicate with each other a network is required.

In CORBA-based distributed computing environments the Bus is most popular. Processors connect to a common network with communication achieved via point-to-point protocols. The bus is composed of Object Request Brokers (ORBs) interconnected using a TCP/IP-based protocol called the Internet Inter-ORB Protocol (IIOP).

Logical Topology

Logical topology describes how processes (i.e., logical nodes) are organized to cooperate with each other. Processes sharing the execution time of a single processor are called competing concurrent processes. They don’t achieve true parallelism since each process competes for the time on a single processor. Non-competing process concurrency can be achieved in distributed systems by executing processes in parallel on physically distinct processors. The figure below depicts a logical process model overlaid onto a physical processor structure. A system consists of a set of processes and logical channels between them. In this case, process ‘a’ on processor ‘A’ has a logical communications path to process ‘d’ on processor ‘D’ which transverses processor ‘C’.

Logical Model

Each processor contains a platform specific ORB that communicates through a physical network supporting TCP/IP. IIOP is the basis for all inter-ORB communications. Messages between processes communicate directly with each using the message routing services supported by each ORB interconnected via IIOP.

Handling Faults

Once you know a system’s topologies, you can implement schemes to handle faults. This requires service replication, fault detection, fault recovery (in a service object, in a host or network connection, or in a Replica Service Agent) and fault notification.

Service Replication

Redundancy, the key to achieving fault tolerance in distributed systems, is often achieved in a CORBA-based distributed systems using replica creation and management services distributed through an ORB bus-based topology. The following figure depicts a typical topology based on three key replication entities: the Replica Service Agent (RSA), the Replication Manager (RM), and the Factory.

Prototypical Replica Service Management

The Factory provides a consistent construction mechanism used in the instantiation of the RM and the server object. Factories are objects containing data and behaviors required to generate the RM, server object, and any other related objects and associations.

The RM detects and communicates faults in application server objects registered with it using heartbeats or polling. It notifies the RSA of the fault and awaits instructions about termination and subsequent recovery actions. In a distributed system, one RM active on each host monitors resident server objects. The RM also terminates faulty server objects and recovers them during fault recovery.

The RSA creates and maintains the state associated with the replication scheme (i.e., hot, warm, or standby) manages replica service querying, and handles the recover process associated with detected faults. Typically there’s only one active RSA for the entire distributed system (i.e., centralized management strategy). In most cases, all RMs and server objects in the distributed system are registered with the RSA, which consolidates information into a centralized repository that it maintains. The RSA state table information is periodically saved to stable storage (i.e., checkpoint) and the RSA is itself instantiated as a fault tolerant entity.

In the example above, a client object (1) requires services from a server object. The RSA (2) identifies the appropriate server object, based on the replication state information, and returns a server object reference to the client object. If the server object doesn’t exist, it creates Factories (3) that instantiate the desired server objects (4) and the associated RM (5), in accordance to the replication scheme.

Fault Detection

Fault detection is the timely ability to identify a fault’s existence and location. Principally in a distributed system, server process faults, network faults, and processor (host) faults must be detected.

Fault Types

Server process (i.e., server object) faults are most often detected using heartbeats or polling. Heartbeat means the server object periodically sends a message to the RM indicating it’s “alive” and functioning normally. Concurrently, the RM periodically checks to see if the server object has sent its heartbeat. If it hasn’t, the RM categorizes the object as faulty and sends a message indicating the fault to the RSA.

In polling, the RM periodically asks if the replica is “alive” and functioning normally. If no response is received, the RM categorizes the service object as faulty and sends a message to the RSA.

Process and network faults can also be detected, but not differentiated, this way by establishing a monitoring relationship between each Replica Manger and the RSA.

Detecting a fault in the RSA process, its associated host processor, or one of its many communications networks, also requires a limited amount of redundancy in the RSA. Typically, this is accomplished using chained RSA replications, as shown below. Each chained RSA (one primary located at host B and two replicas located on host A and C) periodically sends a heartbeat to one of its neighbors and periodically polls itself to determine if the neighboring RSA has sent its heartbeat. If not, a fault in the RSA process, host processor, or network is assumed, and the detecting RSA initiates recovery.

Replica Service Agent Fault Detection

Fault Recovery

Restoring a single faulty object to consistent state (object rollback) can be as simple as instantiating a new object with a state consistent with the last checkpoint. If done frequently, impact on operations due to information latency is minimal.

Recovering a system composed of multiple objects that are collaborating is more difficult. It requires that different objects coordinate checkpointing to achieve global consistency. If they don’t, cascading rollback, commonly called the Domino Effect occurs, and the system only stabilizes when it regresses to the Rollback Line.

A number of approaches can be used to avoid the domino effect. The most common establishes a two-phase checkpointing scheme where coordinated objects use minimal communications, so that any rollback on one object only reverts to the coordinated checkpoint state.

Service Object Fault Recovery: When a registered server object fails, the RSA must identify, elect, and communicate the existence of a new primary server object. It will use the replication information in its state table to elect a new primary server object.

Host Network Fault Recovery: Recovering from loss of a service host or network connection may not only require identifying and communicating a new server object, but potentially establishing a new set of replication services. This isn’t dissimilar from recovering from a lost server object.

Replica Service Agent Fault Recovery: Since the Replica Service Managers form a logical ring composed of a single primary RSAs and multiple backups recovery swill fall into one of two scenarios: failure of the primary or one of the multiple backup RSAs.

Fault Notification.

Notification encompasses both communicating a fault’s presence and locating replicated services. Fault Transparent redundancy implies that the client object need not be aware that it is receiving services from a replicated server object. Fault transparency in a CORBA-based distributed computing environment is achieved using Interceptors. These are optional extensions to the object request broker to allow implementing additional services. CORBA permits two types of interceptors, request-level and message-level. The former is the most frequent used in structured message notifications.

Summary

The Common Object Request Broker Architecture (CORBA) is emerging as a de facto standard for interconnecting distributed business objects in heterogeneous systems. CORBA specifies mechanisms for achieving object-based interoperability but not how to achieve highly reliable fault tolerant objects.

The Object Management Group (OMG), a consortium of over 800 companies responsible for the development of the CORBA specification, has begun to investigate architectures and services to augment the standard for this purpose. Even so, there is scant readily available information on how fault tolerance is being achieved in today’s CORBA-based systems. This research report begins to address this deficiency.

Tom McDonough can be reached at tmcdonough@sema4usa.com. Semaphore's web site is http://www.sema4usa.com.