Guardian is an operating system that was designed to work with the hardware to provide fault tolerance with minimum overhead. The author first describes the main idea behind building fault tolerant systems. The design was centered on the idea of avoiding a single point of failure. Failures should be contained and not cause the entire system to fail. This was achieved by using redundant components to provide continuous operation in case of a component failure. The system has at least 2 CPUs, 2 busses, as well as duplicated I/O and storage components. The downside of this approach as acknowledged by the author is the cost. Duplicating hardware components will almost double the cost of the system.
The operating system was designed to work with the duplicated hardware components. The OS manages the system and transfers the load if it detects a component’s failure. For example if a CPU fails, it is stopped and the load is transferred to the other CPU without system downtime. Depending on the type of failure, the OS either recover from it or the failing component is replaced. The system offered some cool features like the “nonstop” function, which is the ability to remove and replace components without turning off the system. This is very desirable and important feature especially in server environments, where reliability and availability are critical. So, the Tandem platform offered similar features as some of the most powerful and expensive servers in the world like IBM mainframes. If we take that into account, we can say that the high cost of the Tandem system shouldn’t have been a major issue if it offered features that conventional systems could not.
One aspect of their solution that I don’t like is check-pointing. It uses CPU cycles and the author mentioned that it is left to the programmer to decide when and how much data to checkpoint. I am not sure if that is easily feasible. It requires the programmer to have an in-depth knowledge of the system in order to take appropriate checkpoints.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment