Assignment 4 for Sample Solutions - Dependability | CS 686, Assignments of Computer Science

Material Type: Assignment; Professor: Knight; Class: Dependability; Subject: Computer Science; University: University of Virginia; Term: Unknown 1989;

Typology: Assignments

Pre 2010

Uploaded on 07/29/2009

koofers-user-ze2
koofers-user-ze2 🇺🇸

10 documents

1 / 4

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Page 1
Department of Computer Science University of Virginia
CS686 - DEPENDABLE COMPUTING
ASSIGNMENT 4
SAMPLE SOLUTIONS
Individual Activity
There is far more to these questions than most of you realized. Read the following carefully.
1. For the dual redundant architecture, we noted that software has to be deterministic and pre-
dictable. More specifically, we noted that great needs to be exercised in the use of threads and
processes. Develop an approach to the use of threads that would ensure deterministic behavior
of a multi-threaded program.
No threads or concurrency could be considered as an option....
There are two issues to deal with:
(1) Fundamental differences in control flow within the threads.
(2) Timing differences between the two target systems that could affect control flow.
To deal with (1): Threads must operate with a non-pre-emptive scheduler so that there is no
dependence on a system clock for switching between threads. Non-pre-emptive scheduling
means that threads stop executing when they block and not because some timer expires.
Although the high-level architecture of such a system includes a “clock”, that clock is not very
well defined. Even if the clock meant the actual hardware oscillator, you could not count on
precisely the same timing between the two units because of different individual delays.
Also, there can be no separate local devices that could be non-deterministic like disks.
To deal with (2): Even if the same scheduling algorithm is used, the processor timing will
differ thereby offering the possibility of divergence between threads if one gets ahead of
another. This probably means no paging since that affects speed and requires a peripheral
disk. Care has to be taken with cache management to be sure cache content is identical.
Finally, various internal processor clocks could differ so there is a need to synchronize the
machines periodically. The simplest way to do that is to synchronize when output is to be gen-
erated (as seen by the comparator), but that might be too infrequent. Many systems synchro-
nize on a fixed time boundary, i.e., there is set of specific synch points in the code that requires
the two units to synchronize. For example, the Space Shuttle uses four machines in the on-
board Primary Flight Computer and they synchronize every 40 milliseconds.
2. Consider the disk structure known as a mirrored disk. Writes go to two separate identical disks
so that there are always two copies of the data. If one of the disks fails, it is replaced with a
new unit. The new unit is then resilvered, i.e., made to contains the same data as the surviving
original unit. Develop a resilvering algorithm that could be used by the recovery software.
pf3
pf4

Partial preview of the text

Download Assignment 4 for Sample Solutions - Dependability | CS 686 and more Assignments Computer Science in PDF only on Docsity!

CS686 - D EPENDABLE COMPUTING

ASSIGNMENT 4

S AMPLE S OLUTIONS

Individual Activity

There is far more to these questions than most of you realized. Read the following carefully.

  1. For the dual redundant architecture, we noted that software has to be deterministic and pre- dictable. More specifically, we noted that great needs to be exercised in the use of threads and processes. Develop an approach to the use of threads that would ensure deterministic behavior of a multi-threaded program.

No threads or concurrency could be considered as an option....

There are two issues to deal with: (1) Fundamental differences in control flow within the threads. (2) Timing differences between the two target systems that could affect control flow.

To deal with (1): Threads must operate with a non-pre-emptive scheduler so that there is no dependence on a system clock for switching between threads. Non-pre-emptive scheduling means that threads stop executing when they block and not because some timer expires. Although the high-level architecture of such a system includes a “clock”, that clock is not very well defined. Even if the clock meant the actual hardware oscillator, you could not count on precisely the same timing between the two units because of different individual delays.

Also, there can be no separate local devices that could be non-deterministic like disks.

To deal with (2): Even if the same scheduling algorithm is used, the processor timing will differ thereby offering the possibility of divergence between threads if one gets ahead of another. This probably means no paging since that affects speed and requires a peripheral disk. Care has to be taken with cache management to be sure cache content is identical. Finally, various internal processor clocks could differ so there is a need to synchronize the machines periodically. The simplest way to do that is to synchronize when output is to be gen- erated (as seen by the comparator), but that might be too infrequent. Many systems synchro- nize on a fixed time boundary, i.e., there is set of specific synch points in the code that requires the two units to synchronize. For example, the Space Shuttle uses four machines in the on- board Primary Flight Computer and they synchronize every 40 milliseconds.

  1. Consider the disk structure known as a mirrored disk. Writes go to two separate identical disks so that there are always two copies of the data. If one of the disks fails, it is replaced with a new unit. The new unit is then resilvered, i.e., made to contains the same data as the surviving original unit. Develop a resilvering algorithm that could be used by the recovery software.

Throughout the process of resilvering, all writes go to both disks. All reads go to the remain- ing/surviving disk with the data read being written to the replacement disk. The real issue is the algorithm used for the remainder of the data.

A block copy, i.e., mass copy of material from the surviving disk to the new one, is not going to work because of the interference it would cause. The constraint is the time to complete. Slower means longer exposure to a vulnerable state, quicker means that the operations will interfere with normal operation, possibly seriously. Disk operations can be scheduled by the operating system so that the resilvering is always low priority behind normal disk operations. This does not help with the disk being busy because of resilvering when a new application request arrives. A model that helps to reduce vulnerability is to assume that likely requests will be log- ically close to recently referenced data. For example, if a record is read from a file, then sub- sequent reads are likely to come from the same file. Same for write. So resilvering should make logically related/close data a priority.

Suppose the disks were at separate locations, how does that affect things? It makes synchro- nizing anything much more difficult because of the time delay. Disk copies typically are remote for larger systems.

Suppose the data is encrypted, does that change things? It depends where the encryption is done. If it is done in the disk controller, then there is no penalty. If it is done by software on the host, then this places a large compute burden on what was a very I/O intensive activity.

Would you mirror a RAID set? Certainly, in order to obtain geographic diversity. How does all this change RAID, i.e. a duplicated RAID system? It does not, because RAID is for local clusters for the most part. So mirroring of a RAID set is common and undertaken to enable geographic diversity.

  1. Suppose a switched, dual-redundant system is used for a business data-processing system that provides on-line sales from a Web-based catalog of products. For each of the three types of spare that could be used in the system (cold, warm, and hot) summarize the major functions that the architectural support software has to perform: (a) while the primary is running; and (2) when the primary fails so as to get the backup functioning. Treat the different types of spare as separate issues, and structure your answer as a set of three separate sections. Hint: Think carefully about how to handle the database, the active clients, the Web server, and whether faults are going to be masked.

There are system-level design issues that are raised by this question. These issues include: (1) how the system’s two halves are connected to the service network; (2) where the two halves are located; (3) how the two halves communicate; (4) whether the various parts listed are handled as separate problems; and (5) the level of repeated activity will be tolerated by the users. These things are a whole different ballgame.

Cold:

systems, this is fairly simple because the vast majority of the state is the database. Provided the database is protected, the state saved might be as little as details of the set of users logged in, the applications that were running, etc. If the system is not transaction-oriented, then the primary must check periodically that the backup has the same state, and there has to be a time stamp indicating when the state information was saved, i.e., where a restart would start from. To do this, there has to be high-bandwidth, close-to-real-time communications between the two.

(b) Upon failure of the primary, the secondary needs to either start the necessary applications or complete the initialization of partially started applications. Thus, for a transaction-ori- ented system, the whole process is quite simple. Application processes can be running but sus- pended on the secondary during normal operation, and so the switchover can start these processes running and point them at the database and the active set of users. Surprisingly, a warm spare system is the easiest to deal with in this scenario.

Hot:

The whole point of a hot spare is to have a mechanism for fast repair, possibly allowing com- ponent failures to be masked. If the failure is not masked, then there is every expectation that the repair will be very fast indeed.

(a) During normal operation, the state of backup has to be kept synchronized with the pri- mary. Thus, again assuming a transaction-processing system, the problem is to keep the backup completely synchronized with the primary yet not interfere with it. This means that the backup has to “think” that it is performing all transactions even though it is not. Software is needed in this case to track every detail of the operating state.

(b) Upon failure, the actions taken by the backup should be simple. All that should be needed is to connect the backup directly with user requests and with the database. Although simple, this recovery/masking activity still requires software, possibly a lot, and that software has to work correctly.