Fault tolerant systems koren, Study notes of Antennas and Radiowave Propagation

Fault tolerant systems koren

Typology: Study notes

2015/2016
On special offer
30 Points
Discount

Limited-time offer


Uploaded on 01/16/2016

narges90
narges90 🇮🇷

1 document

1 / 399

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41
pf42
pf43
pf44
pf45
pf46
pf47
pf48
pf49
pf4a
pf4b
pf4c
pf4d
pf4e
pf4f
pf50
pf51
pf52
pf53
pf54
pf55
pf56
pf57
pf58
pf59
pf5a
pf5b
pf5c
pf5d
pf5e
pf5f
pf60
pf61
pf62
pf63
pf64
Discount

On special offer

Partial preview of the text

Download Fault tolerant systems koren and more Study notes Antennas and Radiowave Propagation in PDF only on Docsity!

FAULT TOLERANT SYSTEMS

FAULT TOLERANT SYSTEMS

Israel Koren

C. Mani Krishna

AMSTERDAM • BOSTON • HEIDELBERG • LONDON NEW YORK • OXFORD • PARIS • SAN DIEGO SAN FRANCISCO • SINGAPORE • SYDNEY • TOKYO Morgan Kaufmann Publishers is an imprint of Elsevier

Publisher Denise Penrose Publishing Services Manager George Morrison Production Editor Dawnmarie Simpson Assistant Editor Kimberlee Honso Cover Design Alisa Andreola Cover Illustration Yaron Koren Text Design Gene Harris Composition VTEX Copyeditor Graphic World Publishing Services Proofreader Graphic World Publishing Services Indexer Graphic World Publishing Services Interior printer The Maple–Vail Book Manufacturing Group Cover printer Phoenix Color, Inc. Morgan Kaufmann Publishers is an imprint of Elsevier. 500 Sansome Street, Suite 400, San Francisco, CA 94111 This book is printed on acid-free paper. ©c2007, Elsevier, Inc. All rights reserved. Designations used by companies to distinguish their products are often claimed as trademarks or registered trademarks. In all instances in which Morgan Kaufmann Publishers is aware of a claim, the product names appear in initial capital or all capital letters. Readers, however, should contact the appropriate companies for more complete information regarding trademarks and registration. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means—electronic, mechanical, photocopying, scanning, or otherwise—without prior written permis- sion of the publisher. Permissions may be sought directly from Elsevier’s Science & Technology Rights Department in Oxford, UK: phone: (+44) 1865 843830, fax: (+44) 1865 853333, E-mail: [email protected]. You may also complete your request online via the Elsevier homepage ( http://elsevier.com ), by selecting “Support & Contact” then “Copyright and Permission” and then “Obtaining Permissions.” Library of Congress Cataloging-in-Publication Data Koren, Israel, 1945- Fault tolerant systems / Israel Koren, C. Mani Krishna. p. cm. Includes bibliographical references and index. ISBN 0-12-088525-5 (alk. paper)

  1. Fault-tolerant computing. 2. Computer systems–Reliability. I. Krishna, C. M. II. Title. QA76.9.F38K67 2007 004.2–dc22 2006031810 ISBN 13: 978-0-12-088568- ISBN 10: 0-12-088568- For information on all Morgan Kaufmann publications, visit our Web site at www.mkp.com or www.books.elsevier.com Printed in the United States 06 07 08 09 10 5 4 3 2 1
  • 1 Preliminaries About the Authors xix
    • 1.1 Fault Classification
    • 1.2 Types of Redundancy
    • 1.3 Basic Measures of Fault Tolerance
      • 1.3.1 Traditional Measures
      • 1.3.2 Network Measures
    • 1.4 Outline of This Book
    • 1.5 Further Reading
      • References
  • 2 Hardware Fault Tolerance
    • 2.1 The Rate of Hardware Failures
    • 2.2 Failure Rate, Reliability, and Mean Time to Failure
    • 2.3 Canonical and Resilient Structures
      • 2.3.1 Series and Parallel Systems
      • 2.3.2 Non-Series/Parallel Systems
      • 2.3.3 M -of- N Systems
      • 2.3.4 Voters
      • 2.3.5 Variations on N -Modular Redundancy
      • 2.3.6 Duplex Systems
    • 2.4 Other Reliability Evaluation Techniques
      • 2.4.1 Poisson Processes
      • 2.4.2 Markov Models
    • 2.5 Fault-Tolerance Processor-Level Techniques vi Contents
      • 2.5.1 Watchdog Processor
      • 2.5.2 Simultaneous Multithreading for Fault Tolerance
    • 2.6 Byzantine Failures
      • 2.6.1 Byzantine Agreement with Message Authentication
    • 2.7 Further Reading
    • 2.8 Exercises
      • References
  • 3 Information Redundancy
    • 3.1 Coding
      • 3.1.1 Parity Codes
      • 3.1.2 Checksum
      • 3.1.3 M -of- N Codes
      • 3.1.4 Berger Code
      • 3.1.5 Cyclic Codes
      • 3.1.6 Arithmetic Codes
    • 3.2 Resilient Disk Systems
      • 3.2.1 RAID Level
      • 3.2.2 RAID Level
      • 3.2.3 RAID Level
      • 3.2.4 RAID Level
      • 3.2.5 RAID Level
      • 3.2.6 Modeling Correlated Failures
    • 3.3 Data Replication
      • 3.3.1 Voting: Non-Hierarchical Organization
      • 3.3.2 Voting: Hierarchical Organization
      • 3.3.3 Primary-Backup Approach
    • 3.4 Algorithm-Based Fault Tolerance
    • 3.5 Further Reading
    • 3.6 Exercises
      • References
  • 4 Fault-Tolerant Networks
    • 4.1 Measures of Resilience
      • 4.1.1 Graph-Theoretical Measures
      • 4.1.2 Computer Networks Measures
    • 4.2 Common Network Topologies and Their Resilience
      • 4.2.1 Multistage and Extra-Stage Networks
      • 4.2.2 Crossbar Networks
      • 4.2.3 Rectangular Mesh and Interstitial Mesh
      • 4.2.4 Hypercube Network
      • 4.2.5 Cube-Connected Cycles Networks Contents vii
      • 4.2.6 Loop Networks
      • 4.2.7 Ad hoc Point-to-Point Networks
    • 4.3 Fault-Tolerant Routing
      • 4.3.1 Hypercube Fault-Tolerant Routing
      • 4.3.2 Origin-Based Routing in the Mesh
    • 4.4 Further Reading
    • 4.5 Exercises
      • References
  • 5 Software Fault Tolerance
    • 5.1 Acceptance Tests
    • 5.2 Single-Version Fault Tolerance - 5.2.1 Wrappers - 5.2.2 Software Rejuvenation - 5.2.3 Data Diversity - 5.2.4 Software Implemented Hardware Fault Tolerance (SIHFT)
    • 5.3 N -Version Programming - 5.3.1 Consistent Comparison Problem - 5.3.2 Version Independence
    • 5.4 Recovery Block Approach - 5.4.1 Basic Principles - 5.4.2 Success Probability Calculation - 5.4.3 Distributed Recovery Blocks
    • 5.5 Preconditions, Postconditions, and Assertions
    • 5.6 Exception-Handling - 5.6.1 Requirements from Exception-Handlers - 5.6.2 Basics of Exceptions and Exception-Handling - 5.6.3 Language Support
    • 5.7 Software Reliability Models - 5.7.1 Jelinski–Moranda Model - 5.7.2 Littlewood–Verrall Model - 5.7.3 Musa–Okumoto Model - 5.7.4 Model Selection and Parameter Estimation
    • 5.8 Fault-Tolerant Remote Procedure Calls - 5.8.1 Primary-Backup Approach - 5.8.2 The Circus Approach
    • 5.9 Further Reading
    • 5.10 Exercises - References
  • 6 Checkpointing viii Contents
    • 6.1 What is Checkpointing? - 6.1.1 Why is Checkpointing Nontrivial?
    • 6.2 Checkpoint Level
    • 6.3 Optimal Checkpointing—An Analytical Model - 6.3.1 Time Between Checkpoints—A First-Order Approximation - 6.3.2 Optimal Checkpoint Placement - 6.3.3 Time Between Checkpoints—A More Accurate Model - 6.3.4 Reducing Overhead - 6.3.5 Reducing Latency
    • 6.4 Cache-Aided Rollback Error Recovery (CARER)
    • 6.5 Checkpointing in Distributed Systems - 6.5.1 The Domino Effect and Livelock - 6.5.2 A Coordinated Checkpointing Algorithm - 6.5.3 Time-Based Synchronization - 6.5.4 Diskless Checkpointing - 6.5.5 Message Logging
    • 6.6 Checkpointing in Shared-Memory Systems - 6.6.1 Bus-Based Coherence Protocol - 6.6.2 Directory-Based Protocol
    • 6.7 Checkpointing in Real-Time Systems
    • 6.8 Other Uses of Checkpointing
    • 6.9 Further Reading
    • 6.10 Exercises - References
  • 7 Case Studies
    • 7.1 NonStop Systems
      • 7.1.1 Architecture
      • 7.1.2 Maintenance and Repair Aids
      • 7.1.3 Software
      • 7.1.4 Modifications to the NonStop Architecture
    • 7.2 Stratus Systems
    • 7.3 Cassini Command and Data Subsystem
    • 7.4 IBM G5
    • 7.5 IBM Sysplex
    • 7.6 Itanium
    • 7.7 Further Reading
      • References
  • 8 Defect Tolerance in VLSI Circuits
    • 8.1 Manufacturing Defects and Circuit Faults
    • 8.2 Probability of Failure and Critical Area Contents ix
    • 8.3 Basic Yield Models - 8.3.1 The Poisson and Compound Poisson Yield Models - 8.3.2 Variations on the Simple Yield Models
    • 8.4 Yield Enhancement Through Redundancy - 8.4.1 Yield Projection for Chips with Redundancy - 8.4.2 Memory Arrays with Redundancy - 8.4.3 Logic Integrated Circuits with Redundancy - 8.4.4 Modifying the Floorplan
    • 8.5 Further Reading
    • 8.6 Exercises - References
  • 9 Fault Detection in Cryptographic Systems
    • 9.1 Overview of Ciphers - 9.1.1 Symmetric Key Ciphers - 9.1.2 Public Key Ciphers
    • 9.2 Security Attacks Through Fault Injection - 9.2.1 Fault Attacks on Symmetric Key Ciphers - 9.2.2 Fault Attacks on Public (Asymmetric) Key Ciphers
    • 9.3 Countermeasures - 9.3.1 Spatial and Temporal Duplication - 9.3.2 Error-Detecting Codes - 9.3.3 Are These Countermeasures Sufficient? - 9.3.4 Final Comment
    • 9.4 Further Reading
    • 9.5 Exercises - References
  • 10 Simulation Techniques - 10.1 Writing a Simulation Program - 10.2 Parameter Estimation - 10.2.1 Point Versus Interval Estimation - 10.2.2 Method of Moments - 10.2.3 Method of Maximum Likelihood - 10.2.4 The Bayesian Approach to Parameter Estimation - 10.2.5 Confidence Intervals - 10.3 Variance Reduction Methods - 10.3.1 Antithetic Variables - 10.3.2 Using Control Variables - 10.3.3 Stratified Sampling - 10.3.4 Importance Sampling
    • 10.4 Random Number Generation x Contents
      • 10.4.1 Uniformly Distributed Random Number Generators
      • 10.4.2 Testing Uniform Random Number Generators
      • 10.4.3 Generating Other Distributions
    • 10.5 Fault Injection
      • 10.5.1 Types of Fault Injection Techniques
      • 10.5.2 Fault Injection Application and Tools
    • 10.6 Further Reading
    • 10.7 Exercises
      • References
  • Subject Index

Preface

The purpose of this book is to provide a solid introduction to the rich field of fault-

tolerant computing. Its intended use is as a text for senior-level undergraduate and

first-year graduate students, as well as a reference for practicing engineers in the

industry. Since it would be impossible to cover in one book all the fault-tolerance

techniques and practices that have been developed or are currently in use, we

have focused on providing the basics of the field and enough background to allow

the reader to access more easily the rapidly expanding fault-tolerance literature.

Readers who are interested in further details should consult the list of references

at the end of each chapter. To understand this book well, the reader should have

a basic knowledge of hardware design and organization, principles of software

development, and probability theory.

The book has 10 chapters; each chapter has a list of relevant references and a

set of exercises. Solutions to the exercises are available on-line and access to them

is provided by the publisher upon request to instructors who adopt this book as a

textbook for their class. Powerpoint slides for instructors are also available.

The book starts with an outline of preliminaries, in which we provide introduc-

tory information. This is followed by a set of six chapters that form the core of

what we believe should be covered in any introduction to fault-tolerant systems.

Chapter 2 deals with hardware fault-tolerance; this is the discipline with the

longest history (indeed, the idea of using hardware redundancy for fault-tolerance

goes back to the very pioneers of computing, most notably von Neumann). We also

include in this chapter an introduction to some of the probabilistic tools used in

analyzing reliability measures.

Chapter 3 deals with information redundancy with the main focus on error

detecting and correcting codes. Such codes, like hardware fault-tolerance, go back

a very long way, and were motivated in large measure by the need to counter

errors in information transmission. The same, or similar, techniques are being used

today in other applications as well, principally in contemporary memory circuits.

We have sought to provide a survey of only the more important coding techniques,

xiii

Preface xv

educational tools and simulators that can be of great assistance to the readers of

the book. Elsevier also maintains an instructor web site that will house the solu-

tions for those who adopt this book as a textbook for their class. The website can

be found at http://textbooks.elsevier.com.

About the Authors

Israel Koren is a Professor of Electrical and Computer Engineering at the Univer-

sity of Massachusetts, Amherst. Previously, he held positions with the University

of California at Santa Barbara, the University of Southern California at Los An-

geles, the Technion at Haifa, Israel, and the University of California at Berkeley.

He received a BSc (1967), an MSc (1970), and a DSc (1975) in electrical engineer-

ing from the Technion in Haifa, Israel. His research interests include fault-tolerant

systems, VLSI yield and reliability, secure cryptographic systems, and computer

arithmetic. He publishes extensively and has over 200 publications in refereed

journals and conferences. He is an Associate Editor of the IEEE Transactions on

VLSI Systems, the VLSI Design Journal, and the IEEE Computer Architecture Let-

ters. He served as General Chair, Program Chair and Program Committee member

for numerous conferences. He is the author of the textbook Computer Arithmetic

Algorithms, 2nd edition, A.K. Peters, Ltd., 2002, and an editor and co-author of

Defect and Fault-Tolerance in VLSI Systems, Plenum, 1989. Dr. Koren is a fellow

of the IEEE Computer Society.

C. Mani Krishna is a Professor of Electrical and Computer Engineering at the Uni-

versity of Massachusetts, Amherst. He received his PhD in Electrical Engineering

from the University of Michigan in 1984. He previously received a BTech in Elec-

trical Engineering from the Indian Institute of Technology, Delhi, in 1979, and an

MS from the Rensselaer Polytechnic Institute in Troy, NY, in 1980. Since 1984, he

has been on the faculty of the Department of Electrical and Computer Engineer-

ing at the University of Massachusetts at Amherst. He has carried out research in

a number of areas: real-time, fault-tolerant, and distributed systems, sensor net-

works, and performance evaluation of computer systems. He coauthored a book,

Real-Time Systems, McGraw-Hill, 1997, with Kang G. Shin. He has also been an

editor on volumes of readings in performance evaluation and real-time systems,

and for special issues on real-time systems of IEEE Computer and the Proceedings

of the IEEE.

xix