Download Fault tolerant systems koren and more Study notes Antennas and Radiowave Propagation in PDF only on Docsity!
FAULT TOLERANT SYSTEMS
FAULT TOLERANT SYSTEMS
Israel Koren
C. Mani Krishna
AMSTERDAM • BOSTON • HEIDELBERG • LONDON NEW YORK • OXFORD • PARIS • SAN DIEGO SAN FRANCISCO • SINGAPORE • SYDNEY • TOKYO Morgan Kaufmann Publishers is an imprint of Elsevier
Publisher Denise Penrose Publishing Services Manager George Morrison Production Editor Dawnmarie Simpson Assistant Editor Kimberlee Honso Cover Design Alisa Andreola Cover Illustration Yaron Koren Text Design Gene Harris Composition VTEX Copyeditor Graphic World Publishing Services Proofreader Graphic World Publishing Services Indexer Graphic World Publishing Services Interior printer The Maple–Vail Book Manufacturing Group Cover printer Phoenix Color, Inc. Morgan Kaufmann Publishers is an imprint of Elsevier. 500 Sansome Street, Suite 400, San Francisco, CA 94111 This book is printed on acid-free paper. ©c2007, Elsevier, Inc. All rights reserved. Designations used by companies to distinguish their products are often claimed as trademarks or registered trademarks. In all instances in which Morgan Kaufmann Publishers is aware of a claim, the product names appear in initial capital or all capital letters. Readers, however, should contact the appropriate companies for more complete information regarding trademarks and registration. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means—electronic, mechanical, photocopying, scanning, or otherwise—without prior written permis- sion of the publisher. Permissions may be sought directly from Elsevier’s Science & Technology Rights Department in Oxford, UK: phone: (+44) 1865 843830, fax: (+44) 1865 853333, E-mail: [email protected]. You may also complete your request online via the Elsevier homepage ( http://elsevier.com ), by selecting “Support & Contact” then “Copyright and Permission” and then “Obtaining Permissions.” Library of Congress Cataloging-in-Publication Data Koren, Israel, 1945- Fault tolerant systems / Israel Koren, C. Mani Krishna. p. cm. Includes bibliographical references and index. ISBN 0-12-088525-5 (alk. paper)
- Fault-tolerant computing. 2. Computer systems–Reliability. I. Krishna, C. M. II. Title. QA76.9.F38K67 2007 004.2–dc22 2006031810 ISBN 13: 978-0-12-088568- ISBN 10: 0-12-088568- For information on all Morgan Kaufmann publications, visit our Web site at www.mkp.com or www.books.elsevier.com Printed in the United States 06 07 08 09 10 5 4 3 2 1
- 1 Preliminaries About the Authors xix
- 1.1 Fault Classification
- 1.2 Types of Redundancy
- 1.3 Basic Measures of Fault Tolerance
- 1.3.1 Traditional Measures
- 1.3.2 Network Measures
- 1.4 Outline of This Book
- 1.5 Further Reading
- 2 Hardware Fault Tolerance
- 2.1 The Rate of Hardware Failures
- 2.2 Failure Rate, Reliability, and Mean Time to Failure
- 2.3 Canonical and Resilient Structures
- 2.3.1 Series and Parallel Systems
- 2.3.2 Non-Series/Parallel Systems
- 2.3.3 M -of- N Systems
- 2.3.4 Voters
- 2.3.5 Variations on N -Modular Redundancy
- 2.3.6 Duplex Systems
- 2.4 Other Reliability Evaluation Techniques
- 2.4.1 Poisson Processes
- 2.4.2 Markov Models
- 2.5 Fault-Tolerance Processor-Level Techniques vi Contents
- 2.5.1 Watchdog Processor
- 2.5.2 Simultaneous Multithreading for Fault Tolerance
- 2.6 Byzantine Failures
- 2.6.1 Byzantine Agreement with Message Authentication
- 2.7 Further Reading
- 2.8 Exercises
- 3 Information Redundancy
- 3.1 Coding
- 3.1.1 Parity Codes
- 3.1.2 Checksum
- 3.1.3 M -of- N Codes
- 3.1.4 Berger Code
- 3.1.5 Cyclic Codes
- 3.1.6 Arithmetic Codes
- 3.2 Resilient Disk Systems
- 3.2.1 RAID Level
- 3.2.2 RAID Level
- 3.2.3 RAID Level
- 3.2.4 RAID Level
- 3.2.5 RAID Level
- 3.2.6 Modeling Correlated Failures
- 3.3 Data Replication
- 3.3.1 Voting: Non-Hierarchical Organization
- 3.3.2 Voting: Hierarchical Organization
- 3.3.3 Primary-Backup Approach
- 3.4 Algorithm-Based Fault Tolerance
- 3.5 Further Reading
- 3.6 Exercises
- 4 Fault-Tolerant Networks
- 4.1 Measures of Resilience
- 4.1.1 Graph-Theoretical Measures
- 4.1.2 Computer Networks Measures
- 4.2 Common Network Topologies and Their Resilience
- 4.2.1 Multistage and Extra-Stage Networks
- 4.2.2 Crossbar Networks
- 4.2.3 Rectangular Mesh and Interstitial Mesh
- 4.2.4 Hypercube Network
- 4.2.5 Cube-Connected Cycles Networks Contents vii
- 4.2.6 Loop Networks
- 4.2.7 Ad hoc Point-to-Point Networks
- 4.3 Fault-Tolerant Routing
- 4.3.1 Hypercube Fault-Tolerant Routing
- 4.3.2 Origin-Based Routing in the Mesh
- 4.4 Further Reading
- 4.5 Exercises
- 5 Software Fault Tolerance
- 5.1 Acceptance Tests
- 5.2 Single-Version Fault Tolerance - 5.2.1 Wrappers - 5.2.2 Software Rejuvenation - 5.2.3 Data Diversity - 5.2.4 Software Implemented Hardware Fault Tolerance (SIHFT)
- 5.3 N -Version Programming - 5.3.1 Consistent Comparison Problem - 5.3.2 Version Independence
- 5.4 Recovery Block Approach - 5.4.1 Basic Principles - 5.4.2 Success Probability Calculation - 5.4.3 Distributed Recovery Blocks
- 5.5 Preconditions, Postconditions, and Assertions
- 5.6 Exception-Handling - 5.6.1 Requirements from Exception-Handlers - 5.6.2 Basics of Exceptions and Exception-Handling - 5.6.3 Language Support
- 5.7 Software Reliability Models - 5.7.1 Jelinski–Moranda Model - 5.7.2 Littlewood–Verrall Model - 5.7.3 Musa–Okumoto Model - 5.7.4 Model Selection and Parameter Estimation
- 5.8 Fault-Tolerant Remote Procedure Calls - 5.8.1 Primary-Backup Approach - 5.8.2 The Circus Approach
- 5.9 Further Reading
- 5.10 Exercises - References
- 6 Checkpointing viii Contents
- 6.1 What is Checkpointing? - 6.1.1 Why is Checkpointing Nontrivial?
- 6.2 Checkpoint Level
- 6.3 Optimal Checkpointing—An Analytical Model - 6.3.1 Time Between Checkpoints—A First-Order Approximation - 6.3.2 Optimal Checkpoint Placement - 6.3.3 Time Between Checkpoints—A More Accurate Model - 6.3.4 Reducing Overhead - 6.3.5 Reducing Latency
- 6.4 Cache-Aided Rollback Error Recovery (CARER)
- 6.5 Checkpointing in Distributed Systems - 6.5.1 The Domino Effect and Livelock - 6.5.2 A Coordinated Checkpointing Algorithm - 6.5.3 Time-Based Synchronization - 6.5.4 Diskless Checkpointing - 6.5.5 Message Logging
- 6.6 Checkpointing in Shared-Memory Systems - 6.6.1 Bus-Based Coherence Protocol - 6.6.2 Directory-Based Protocol
- 6.7 Checkpointing in Real-Time Systems
- 6.8 Other Uses of Checkpointing
- 6.9 Further Reading
- 6.10 Exercises - References
- 7 Case Studies
- 7.1 NonStop Systems
- 7.1.1 Architecture
- 7.1.2 Maintenance and Repair Aids
- 7.1.3 Software
- 7.1.4 Modifications to the NonStop Architecture
- 7.2 Stratus Systems
- 7.3 Cassini Command and Data Subsystem
- 7.4 IBM G5
- 7.5 IBM Sysplex
- 7.6 Itanium
- 7.7 Further Reading
- 8 Defect Tolerance in VLSI Circuits
- 8.1 Manufacturing Defects and Circuit Faults
- 8.2 Probability of Failure and Critical Area Contents ix
- 8.3 Basic Yield Models - 8.3.1 The Poisson and Compound Poisson Yield Models - 8.3.2 Variations on the Simple Yield Models
- 8.4 Yield Enhancement Through Redundancy - 8.4.1 Yield Projection for Chips with Redundancy - 8.4.2 Memory Arrays with Redundancy - 8.4.3 Logic Integrated Circuits with Redundancy - 8.4.4 Modifying the Floorplan
- 8.5 Further Reading
- 8.6 Exercises - References
- 9 Fault Detection in Cryptographic Systems
- 9.1 Overview of Ciphers - 9.1.1 Symmetric Key Ciphers - 9.1.2 Public Key Ciphers
- 9.2 Security Attacks Through Fault Injection - 9.2.1 Fault Attacks on Symmetric Key Ciphers - 9.2.2 Fault Attacks on Public (Asymmetric) Key Ciphers
- 9.3 Countermeasures - 9.3.1 Spatial and Temporal Duplication - 9.3.2 Error-Detecting Codes - 9.3.3 Are These Countermeasures Sufficient? - 9.3.4 Final Comment
- 9.4 Further Reading
- 9.5 Exercises - References
- 10 Simulation Techniques - 10.1 Writing a Simulation Program - 10.2 Parameter Estimation - 10.2.1 Point Versus Interval Estimation - 10.2.2 Method of Moments - 10.2.3 Method of Maximum Likelihood - 10.2.4 The Bayesian Approach to Parameter Estimation - 10.2.5 Confidence Intervals - 10.3 Variance Reduction Methods - 10.3.1 Antithetic Variables - 10.3.2 Using Control Variables - 10.3.3 Stratified Sampling - 10.3.4 Importance Sampling
- 10.4 Random Number Generation x Contents
- 10.4.1 Uniformly Distributed Random Number Generators
- 10.4.2 Testing Uniform Random Number Generators
- 10.4.3 Generating Other Distributions
- 10.5 Fault Injection
- 10.5.1 Types of Fault Injection Techniques
- 10.5.2 Fault Injection Application and Tools
- 10.6 Further Reading
- 10.7 Exercises
- Subject Index
Preface
The purpose of this book is to provide a solid introduction to the rich field of fault-
tolerant computing. Its intended use is as a text for senior-level undergraduate and
first-year graduate students, as well as a reference for practicing engineers in the
industry. Since it would be impossible to cover in one book all the fault-tolerance
techniques and practices that have been developed or are currently in use, we
have focused on providing the basics of the field and enough background to allow
the reader to access more easily the rapidly expanding fault-tolerance literature.
Readers who are interested in further details should consult the list of references
at the end of each chapter. To understand this book well, the reader should have
a basic knowledge of hardware design and organization, principles of software
development, and probability theory.
The book has 10 chapters; each chapter has a list of relevant references and a
set of exercises. Solutions to the exercises are available on-line and access to them
is provided by the publisher upon request to instructors who adopt this book as a
textbook for their class. Powerpoint slides for instructors are also available.
The book starts with an outline of preliminaries, in which we provide introduc-
tory information. This is followed by a set of six chapters that form the core of
what we believe should be covered in any introduction to fault-tolerant systems.
Chapter 2 deals with hardware fault-tolerance; this is the discipline with the
longest history (indeed, the idea of using hardware redundancy for fault-tolerance
goes back to the very pioneers of computing, most notably von Neumann). We also
include in this chapter an introduction to some of the probabilistic tools used in
analyzing reliability measures.
Chapter 3 deals with information redundancy with the main focus on error
detecting and correcting codes. Such codes, like hardware fault-tolerance, go back
a very long way, and were motivated in large measure by the need to counter
errors in information transmission. The same, or similar, techniques are being used
today in other applications as well, principally in contemporary memory circuits.
We have sought to provide a survey of only the more important coding techniques,
xiii
Preface xv
educational tools and simulators that can be of great assistance to the readers of
the book. Elsevier also maintains an instructor web site that will house the solu-
tions for those who adopt this book as a textbook for their class. The website can
be found at http://textbooks.elsevier.com.
About the Authors
Israel Koren is a Professor of Electrical and Computer Engineering at the Univer-
sity of Massachusetts, Amherst. Previously, he held positions with the University
of California at Santa Barbara, the University of Southern California at Los An-
geles, the Technion at Haifa, Israel, and the University of California at Berkeley.
He received a BSc (1967), an MSc (1970), and a DSc (1975) in electrical engineer-
ing from the Technion in Haifa, Israel. His research interests include fault-tolerant
systems, VLSI yield and reliability, secure cryptographic systems, and computer
arithmetic. He publishes extensively and has over 200 publications in refereed
journals and conferences. He is an Associate Editor of the IEEE Transactions on
VLSI Systems, the VLSI Design Journal, and the IEEE Computer Architecture Let-
ters. He served as General Chair, Program Chair and Program Committee member
for numerous conferences. He is the author of the textbook Computer Arithmetic
Algorithms, 2nd edition, A.K. Peters, Ltd., 2002, and an editor and co-author of
Defect and Fault-Tolerance in VLSI Systems, Plenum, 1989. Dr. Koren is a fellow
of the IEEE Computer Society.
C. Mani Krishna is a Professor of Electrical and Computer Engineering at the Uni-
versity of Massachusetts, Amherst. He received his PhD in Electrical Engineering
from the University of Michigan in 1984. He previously received a BTech in Elec-
trical Engineering from the Indian Institute of Technology, Delhi, in 1979, and an
MS from the Rensselaer Polytechnic Institute in Troy, NY, in 1980. Since 1984, he
has been on the faculty of the Department of Electrical and Computer Engineer-
ing at the University of Massachusetts at Amherst. He has carried out research in
a number of areas: real-time, fault-tolerant, and distributed systems, sensor net-
works, and performance evaluation of computer systems. He coauthored a book,
Real-Time Systems, McGraw-Hill, 1997, with Kang G. Shin. He has also been an
editor on volumes of readings in performance evaluation and real-time systems,
and for special issues on real-time systems of IEEE Computer and the Proceedings
of the IEEE.
xix