Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

IEEE Standard 754 for Binary Floating-Point Arithmetic - Lecture Notes | CS 1541, Study notes of Computer Science

University of Pittsburgh (Pitt) - Medical Center-Health System Computer Science

Prof. Sangyeun Cho

Material Type: Notes; Professor: Cho; Class: INTRO TO COMPUTER ARCHITECTURE; Subject: Computer Science; University: University of Pittsburgh; Term: Summer 1996;

Typology: Study notes

Pre 2010

Uploaded on 09/02/2009

koofers-user-f3j 🇺🇸

10 documents

1 / 30

This page cannot be seen from the preview

Don't miss anything!

Work in Progress: Lecture Notes on the Status of IEEE 754 May 31, 1996 2:44 pm

Page 1

Lecture Notes on the Status of

IEEE Standard 754 for Binary Floating-Point Arithmetic

Prof. W. Kahan

Elect. Eng. & Computer Science

University of California

Berkeley CA 94720-1776

Introduction:

Twenty years ago anarchy threatened floating-point arithmetic. Over a dozen commercially significant arithmetics

boasted diverse wordsizes, precisions, rounding procedures and over/underflow behaviors, and more were in the

works. “Portable” software intended to reconcile that numerical diversity had become unbearably costly to

develop.

Eleven years ago, when IEEE 754 became official, major microprocessor manufacturers had already adopted it

despite the challenge it posed to implementors. With unprecedented altruism, hardware designers rose to its

challenge in the belief that they would ease and encourage a vast burgeoning of numerical software. They did

succeed to a considerable extent. Anyway, rounding anomalies that preoccupied all of us in the 1970s afflict only

CRAYs now.

Now atrophy threatens features of IEEE 754 caught in a vicious circle:

Those features lack support in programming languages and compilers,

so those features are mishandled and/or practically unusable,

so those features are little known and less in demand, and so

those features lack support in programming languages and compilers.

To help break that circle, those features are discussed in these notes under the following headings:

Representable Numbers, Normal and Subnormal, Infinite and NaN 2

Encodings, Span and Precision 3-4

Multiply-Accumulate, a Mixed Blessing 5

Exceptions in General; Retrospective Diagnostics 6

Exception: Invalid Operation; NaNs 7

Exception: Divide by Zero; Infinities 10

Digression on Division by Zero; Two Examples 10

Exception: Overflow 14

Exception: Underflow 15

Digression on Gradual Underflow; an Example 16

Exception: Inexact 18

Directions of Rounding 18

Precisions of Rounding 19

The Baleful Influence of Benchmarks; a Proposed Benchmark 20

Exceptions in General, Reconsidered; a Suggested Scheme 23

Ruminations on Programming Languages 29

Annotated Bibliography 30

Insofar as this is a status report, it is subject to change and supersedes versions with earlier dates. This version

supersedes one distributed at a panel discussion of “Floating-Point Past, Present and Future” in a series of San

Francisco Bay Area Computer History Perspectives sponsored by Sun Microsystems Inc. in May 1995. A Post-

Script version is accessible electronically as http://http.cs.berkeley.edu/~wkahan/ieee754status/ieee754.ps .

This document was created with FrameMaker404

Discover Study notes of Computer Science University of Pittsburgh (Pitt) - Medical Center-Health System

Partial preview of the text

Download IEEE Standard 754 for Binary Floating-Point Arithmetic - Lecture Notes | CS 1541 and more Study notes Computer Science in PDF only on Docsity!

Page 1

Lecture Notes on the Status of

IEEE Standard 754 for Binary Floating-Point Arithmetic

Prof. W. Kahan Elect. Eng. & Computer Science University of California Berkeley CA 94720-

Introduction:

Twenty years ago anarchy threatened floating-point arithmetic. Over a dozen commercially significant arithmetics boasted diverse wordsizes, precisions, rounding procedures and over/underflow behaviors, and more were in the works. “Portable” software intended to reconcile that numerical diversity had become unbearably costly to develop.

Eleven years ago, when IEEE 754 became official, major microprocessor manufacturers had already adopted it despite the challenge it posed to implementors. With unprecedented altruism, hardware designers rose to its challenge in the belief that they would ease and encourage a vast burgeoning of numerical software. They did succeed to a considerable extent. Anyway, rounding anomalies that preoccupied all of us in the 1970s afflict only CRAYs now.

Now atrophy threatens features of IEEE 754 caught in a vicious circle: Those features lack support in programming languages and compilers, so those features are mishandled and/or practically unusable, so those features are little known and less in demand, and so those features lack support in programming languages and compilers. To help break that circle, those features are discussed in these notes under the following headings:

Representable Numbers, Normal and Subnormal, Infinite and NaN 2 Encodings, Span and Precision 3- Multiply-Accumulate, a Mixed Blessing 5 Exceptions in General; Retrospective Diagnostics 6 Exception: Invalid Operation; NaNs 7 Exception: Divide by Zero; Infinities 10 Digression on Division by Zero; Two Examples 10

Exception: Overflow 14 Exception: Underflow 15 Digression on Gradual Underflow; an Example 16 Exception: Inexact 18 Directions of Rounding 18 Precisions of Rounding 19 The Baleful Influence of Benchmarks; a Proposed Benchmark 20

Exceptions in General, Reconsidered; a Suggested Scheme 23 Ruminations on Programming Languages 29 Annotated Bibliography 30

Insofar as this is a status report, it is subject to change and supersedes versions with earlier dates. This version supersedes one distributed at a panel discussion of “Floating-Point Past, Present and Future” in a series of San Francisco Bay Area Computer History Perspectives sponsored by Sun Microsystems Inc. in May 1995. A Post- Script version is accessible electronically as http://http.cs.berkeley.edu/~wkahan/ieee754status/ieee754.ps.

This document was created with FrameMaker 4 0 4

Representable Numbers:

IEEE 754 specifies three types or Formats of floating-point numbers:

Single ( Fortran's REAL4, C's float ), ( Obligatory ), Double ( Fortran's REAL8, C's double ), ( Ubiquitous ), and Double-Extended ( Fortran REAL*10+, C's long double ), ( Optional ). ( A fourth Quadruple-Precision format is not specified by IEEE 754 but has become a de facto standard among several computer makers none of whom support it fully in hardware yet, so it runs slowly at best.)

Each format has representations for NaNs (Not-a-Number), ±∞ (Infinity), and its own set of finite real numbers all of the simple form

2 k+1-N^ n with two integers n ( signed Significand ) and k ( unbiased signed Exponent ) that run throughout two intervals determined from the format thus:

K+1 Exponent bits: 1 - 2K^ < k < 2K^. N Significant bits: -2N^ < n < 2N^.

This concise representation 2 k+1-N^ n , unique to IEEE 754, is deceptively simple. At first sight it appears potentially ambiguous because, if n is even, dividing n by 2 ( a right-shift ) and then adding 1 to k makes no difference. Whenever such an ambiguity could arise it is resolved by minimizing the exponent k and thereby maximizing the magnitude of significand n ; this is “ Normalization ” which, if it succeeds, permits a Normal

nonzero number to be expressed in the form 2 k+1-N^ n = ± 2 k^ ( 1 + f ) with a nonnegative fraction f < 1.

Besides these Normal numbers, IEEE 754 has Subnormal ( Denormalized ) numbers lacking or suppressed in earlier computer arithmetics; Subnormals, which permit Underflow to be Gradual, are nonzero numbers with an unnormalized significand n and the same minimal exponent k as is used for 0 :

Subnormal 2 k+1-N^ n = ± 2 k^ (0 + f ) has k = 2 - 2K^ and 0 < | n | < 2N-1^ , so 0 < f < 1.

Thus, where earlier arithmetics had conspicuous gaps between 0 and the tiniest Normal numbers ± 2 2-

K , IEEE 754 fills the gaps with Subnormals spaced the same distance apart as the smallest Normal numbers:

Subnormals [--- Normalized Numbers ----- - - - - - - - - - -> | | | 0-!-!-+-!-+-+-+-!-+-+-+-+-+-+-+-!---+---+---+---+---+---+---+---!------ - - | | | | | |

Powers of 2 : 2 2-

K 2 3-

K 2 4-

-+- Consecutive Positive Floating-Point Numbers -+-

Table of Formats’ Parameters:

Format Bytes K+1 N

Single 4 8 24

Double 8 11 53

Double-Extended ≥ 10 ≥ 15 ≥ 64

( Quadruple 16 15 113 )

The following table exhibits the span of each floating-point format, and its precision both as an upper bound 2 -N upon relative error ß and in “ Significant Decimals.”

Span and Precision of IEEE 754 Floating-Point Formats :

Entries in this table come from the following formulas:

Min. Positive Subnormal: 2 3 - 2K - N Min. Positive Normal: 2 2 - 2K Max. Finite: (1 - 1/2N) 2 2K

Sig. Dec., at least: floor( (N-1) Log 10 (2) ) sig. dec. at most: ceil( 1 + N Log 10 (2) ) sig. dec.

The precision is bracketed within a range in order to characterize how accurately conversion between binary and decimal has to be implemented to conform to IEEE 754. For instance, “ 6 - 9 ” Sig. Dec. for Single means that, in the absence of OVER/UNDERFLOW, ...

If a decimal string with at most 6 sig. dec. is converted to Single and then converted back to the same number of sig. dec., then the final string should match the original. Also, ...

If a Single Precision floating-point number is converted to a decimal string with at least 9 sig. dec. and then converted back to Single, then the final number must match the original.

Most microprocessors that support floating-point on-chip, and all that serve in prestigious workstations, support just the two REAL4 and REAL8 floating-point formats. In some cases the registers are all 8 bytes wide, and REAL4 operands are converted on the fly to their REAL8 equivalents when they are loaded into a register; in such cases, immediately rounding to REAL4 every REAL8 result of an operation upon such converted operands produces the same result as if the operation had been performed in the REAL*4 format all the way.

But Motorola 680x0-based Macintoshes and Intel ix86-based PCs with ix87-based ( not Weitek’s 1167 or 3167 ) floating-point behave quite differently; they perform all arithmetic operations in the Extended format, regardless of the operands’ widths in memory, and round to whatever precision is called for by the setting of a control word.

Only the Extended format appears in a 680x0’s eight floating-point flat registers or an ix87’s eight floating- point stack-registers, so all numbers loaded from memory in any other format, floating-point or integer or BCD, are converted on the fly into Extended with no change in value. All arithmetic operations enjoy the Extended range and precision. Values stored from a register into a narrower memory format get rounded on the fly, and may also incur OVER/UNDERFLOW. ( Since the register’s value remains unchanged, unless popped off the ix87’s stack, misconstrued ambiguities in manuals or ill-considered “ optimizations ” cause some compilers sometimes wrongly to reuse that register’s value in place of what was stored from it; this subtle bug will be re-examined later under " Precisions of Rounding " below.)

Since the Extended format is optional in implementations of IEEE 754, most chips do not offer it; it is available only on Intel’s x86/x87, Pentium, P6 and their clones by AMD and Cyrix, on Intel’s 80960 KB, on Motorola’s 68040/60 or earlier 680x0 with 68881/2 coprocessor, and on Motorola’s 88110, all with 64 sig.

Format Min. Subnormal Min. Normal Max. Finite 2 -N^ Sig. Dec.

Single: 1.4 E-45 1.2 E-38 3.4 E38 5.96 E-8 6 - 9 Double: 4.9 E-324 2.2 E-308 1.8 E308 1.11 E-16 15 - 17 Extended: ≤ 3.6 E-4951 ≤ 3.4 E-4932 ≥ 1.2 E4932 ≤ 5.42 E-20 ≥ 18 - 21 ( Quadruple: 6.5 E-4966 3.4 E-4932 1.2 E4932 9.63 E-35 33 - 36 )

bits and 15 bits of exponent, but in words that may be 80 or 96 or 128 bits wide when stored in memory. This format is intended mainly to help programmers enhance the integrity of their Single and Double software, and to attenuate degradation by roundoff in Double matrix computations of larger dimensions, and can easily be used in such a way that substituting Quadruple for Extended need never invalidate its use. However, language support for Extended is hard to find.

Multiply-Accumulate , a Mixed Blessing:

The IBM Power PC and Apple Power Macintosh, both derived from the IBM RS/6000 architecture, purport to conform to IEEE 754 but too often use a “ Fused ” Multiply-Add instruction in a non-conforming way. The idea behind a Multiply-Add ( or “ MAC ” for “ Multiply-Accumulate ” ) instruction is that an expression like ±a*b ± c be evaluated in one instruction so implemented that scalar products like a 1 *b 1 + a 2 *b 2 + a 3 b 3 + ... + aLbL

can be evaluated in about L+3 machine cycles. Many machines have a MAC. Beyond that, a Fused MAC evaluates ±a*b ± c with just one rounding error at the end. This is done not so much to roughly halve the rounding errors in a scalar product as to facilitate fast and correctly rounded division without much hardware dedicated to it.

To compute q = x/y correctly rounded, it suffices to have hardware approximate the reciprocal 1/y to several sig. bits by a value t looked up in a table, and then improve t by iteration thus: t := t + (1 - ty)t. Each such iteration doubles the number of correct bits in t at the cost of two MACs until t is accurate enough to produce q := tx. To round q correctly, its remainder r := x - qy must be obtained exactly; this is what the “ Fused ” in the Fused MAC is for. It also speeds up correctly rounded square root, decimal <-> binary conversion, and some transcendental functions. These and other uses make a Fused MAC worth putting into a computer's instruction set. ( If only division and square root were at stake we might do better merely to widen the multiplier hardware slightly in a way accessible solely to microcode, as TI does in its SPARC chips.)

A Fused MAC also speeds up a grubby “Doubled-Double” approximation to Quadruple-Precision arithmetic by unevaluated sums of pairs of Doubles. Its advantage comes about from a Fused MAC's ability to evaluate any product ab exactly; first let p := ab rounded off; then compute c := ab - p exactly in another Fused MAC, so that ab = p + c exactly without roundoff. Fast but grubby Double-Double undermines the incentive to provide Quadruple-Precision correctly rounded in IEEE 754's style.

Fused MACs generate anomalies when used to evaluate ab ± cd in two instructions instead of three. Which of ab and cd is evaluated and therefore rounded first? Either way, important expectations can be thwarted. For example, multiplying a complex number by its complex conjugate should produce a real number, but it might not

with a Fused MAC. If √( q^2 - pr ) is real in the absence of roundoff, then the same is expected for SQRT( qq - p*r ) despite roundoff, but perhaps not with a Fused MAC. Therefore Fused MACs cannot be used indiscriminately; there are a few programs that contain a few assignment statements from which Fused MACs must be banned.

By design, a Fused MAC always runs faster than separate multiplication and add, so compiler writers with one eye on benchmarks based solely upon speed leave programmers no way to inhibit Fused MACs selectively within expressions, nor to ban them from a selected assignment statement.

Ideally, some locution like redundant parentheses should be understood to control the use of Fused MACs on machines that have them. For instance, in Fortran, ... (AB) + CD and CD + (AB) should always round AB first; (AB) + (C*D) should inhibit the use of a Fused MAC here. Something else is needed for C , whose Macro Preprocessor often insinuates hordes of redundant parentheses. Whatever expedient is chosen must have no effect upon compilations to machines that lack a Fused MAC; a separate compiler directive at the beginning of a program should say whether the program is intended solely for machines with, or solely for machines without a Fused MAC.

Exception: INVALID operation.

Signaled by the raising of the INVALID flag whenever an operation's operands lie outside its domain, this exception's default, delivered only because any other real or infinite value would most likely cause worse confusion, is NaN , which means “ Not a Number.” IEEE 754 specifies that seven invalid arithmetic operations shall deliver a NaN unless they are trapped: real √(Negative) , 0*∞ , 0.0/0.0 , ∞/∞, REMAINDER(Anything, 0.0) , REMAINDER( ∞, Anything ) , ∞ - ∞ when signs agree ( but ∞ + ∞ = ∞ when signs agree ). Conversion from floating-point to other formats can be INVALID too, if their limits are violated, even if no NaN can be delivered.

NaN also means “ Not any Number ” ; NaN does not represent the set of all real numbers, which is an interval for which the appropriate representation is provided by a scheme called “ Interval Arithmetic.”

NaN must not be confused with “ Undefined.” On the contrary, IEEE 754 defines NaN perfectly well even though most language standards ignore and many compilers deviate from that definition. The deviations usually afflict relational expressions, discussed below. Arithmetic operations upon NaNs other than SNaNs ( see below ) never signal INVALID, and always produce NaN unless replacing every NaN operand by any finite or infinite real values would produce the same finite or infinite floating-point result independent of the replacements.

For example, 0NaN must be NaN because 0∞ is an INVALID operation ( NaN ). On the other hand, for hypot(x, y) := √(xx + yy) we find that hypot(∞, y) = +∞ for all real y , finite or not, and deduce that hypot(∞, NaN) = +∞ too; naive implementations of hypot may do differently.

NaNs were not invented out of whole cloth. Konrad Zuse tried similar ideas in the late 1930s; Seymour Cray built “ Indefinites ” into the CDC 6600 in 1963; then DEC put “ Reserved Operands ” into their PDP- and VAX. But nobody used them because they trap when touched. NaNs do not trap ( unless they are “ Signaling ” SNaNs, which exist mainly for political reasons and are rarely used ); NaNs propagate through most computations. Consequently they do get used.

Perhaps NaNs are widely misunderstood because they are not needed for mathematical analysis, whose sequencing is entirely logical; they are needed only for computation, with temporal sequencing that can be hard to revise, harder to reverse. NaNs must conform to mathematically consistent rules that were deduced, not invented arbitrarily, in 1977 during the design of the Intel 8087 that preceded IEEE 754. What had been missing from computation but is now supplied by NaNs is an opportunity ( not obligation ) for software ( especially when searching ) to follow an unexceptional path ( no need for exotic control structures ) to a point where an exceptional event can be appraised after the event, when additional evidence may have accrued. Deferred judgments are usually better judgments but not always , alas.

Whenever a NaN is created from non-NaN operands, IEEE 754 demands that the INVALID OPERATION flag be raised, but does not say whether a flag is a word in memory or a bit in a hardware Status Word. That flag stays raised until the program lowers it. ( The Motorola 680x0 also raises or lowers a transient flag that pertains solely to the last floating-point operation executed.) The “ Sticky ” flag mandated by IEEE 754 allows programmers to test it later at a convenient place to detect previous INVALID operations and compensate for them, rather than be forced to prevent them. However, ... +-------------------------------------------------------------------+ | Microsoft's C and C++ compilers defeat that purpose of the | | INVALID flag by using it exclusively to detect floating-point | | stack overflows, so programmers cannot use it ( via library | | functions _clear87 and _status87 ) for their own purposes. | +-------------------------------------------------------------------+ This flagrant violation of IEEE 754 appears not to weigh on Microsoft’s corporate conscience. So far as I know, Borland's C , C++ and Pascal compilers do not abuse the INVALID flag that way.

............................

While on the subject of miscreant compilers, we should remark their increasingly common tendency to reorder operations that can be executed concurrently by pipelined computers. C programmers may declare a variable volatile to inhibit certain reorderings. A programmer's intention is thwarted when an alleged “ optimization ” moves a floating-point instruction past a procedure-call intended to deal with a flag in the floating-point status word or to write into the control word to alter trapping or rounding. Bad moves like these have been made even by compilers that come supplied with such procedures in their libraries. ( See _control87 , _clear87 and _status in compilers for Intel processors.) Operations’ movements would be easier to debug if they were highlighted by the compiler in its annotated re-listing of the source-code. Meanwhile, so long as compilers mishandle attempts to cope with floating-point exceptions, flags and modes in the ways intended by IEEE Standard 754, frustrated programmers will abandon such attempts and compiler writers will infer wrongly that unexercised capabilities are unexercised for lack of demand.................................

IEEE 754's specification for NaN endows it with a field of bits into which software can record, say, how and/or where the NaN came into existence. That information would be extremely helpful for subsequent “ Retrospective Diagnosis ” of malfunctioning computations, but no software exists now to employ it. Customarily that field has been copied from an operand NaN to the result NaN of every arithmetic operation, or filled with binary 1000...000 when a new NaN was created by an untrapped INVALID operation. For lack of software to exploit it, that custom has been atrophying.

680x0 and ix87 treat a NaN with any nonzero binary 0xxx...xxx in that field as an SNaN ( Signaling NaN ) to fulfill a requirement of IEEE 754. An SNaN may be moved ( copied ) without incident, but any other arithmetic operation upon an SNaN is an INVALID operation ( and so is loading one onto the ix87's stack ) that must trap or else produce a new nonsignaling NaN. ( Another way to turn an SNaN into a NaN is to turn 0xxx...xxx into 1xxx...xxx with a logical OR.) Intended for, among other things, data missing from statistical collections, and for uninitialized variables, SNaNs seem preferable for such purposes to zeros or haphazard traces left in memory by a previous program. However, no more will be said about SNaNs here.

Were there no way to get rid of NaNs, they would be as useless as Indefinites on CRAYs; as soon as one were encountered, computation would be best stopped rather than continued for an indefinite time to an Indefinite conclusion. That is why some operations upon NaNs must deliver non-NaN results. Which operations?

Disagreements about some of them are inevitable, but that grants no license to resolve the disagreements by making arbitrary choices. Every real ( not logical ) function that produces the same floating-point result for all finite and infinite numerical values of an argument should yield the same result when that argument is NaN. ( Recall hypot above.)

The exceptions are C predicates “ x == x ” and “ x != x ”, which are respectively 1 and 0 for every infinite or finite number x but reverse if x is Not a Number ( NaN ); these provide the only simple unexceptional distinction between NaNs and numbers in languages that lack a word for NaN and a predicate IsNaN(x). Over- optimizing compilers that substitute 1 for x == x violate IEEE 754.

IEEE 754 assigns values to all relational expressions involving NaN. In the syntax of C , the predicate x != y is True but all others, x < y , x <= y , x == y , x >= y and x > y , are False whenever x or y or both are NaN, and then all but x != y and x == y are INVALID operations too and must so signal. Ideally, expressions x !< y , x !<= y , x !>= y , x !> y and x !>=< y should be valid and quietly True if x or y or both are NaN , but arbiters of taste and fashion for ANSI Standard C have refused to recognize such expressions. In any event, !(x < y) differs from x >= y when NaN is involved, though rude compilers “ optimize ” the difference away. Worse, some compilers mishandle NaNs in all relational expressions.

Some language standards conflict with IEEE 754. For example, APL specifies 1.0 for 0.0/0.0 ; this specification is one that APL's designers soon regretted. Sometimes naive compile-time optimizations replace expressions x/x by 1 ( wrong if x is zero, ∞ or NaN ) and x - x by 0 ( wrong if x is ∞ or NaN ) and 0*x and 0/x by 0 ( wrong if ... ), alas.

Exception: DIVIDE by ZERO.

This is a misnomer perpetrated for historical reasons. A better name for this exception is “ Infinite result computed Exactly from Finite operands. ” An example is 3.0/0.0 , for which IEEE 754 specifies an Infinity as the default result. The sign bit of that result is, as usual for quotients, the exclusive OR of the operands' sign bits. Since 0.0 can have either sign, so can ∞; in fact, division by zero is the only algebraic operation that reveals the sign of zero. ( IEEE 754 recommends a non-algebraic function CopySign to reveal a sign without ever signaling an exception, but few compilers offer it, alas.)

Ideally, certain other real expressions should be treated just the way IEEE 754 treats divisions by zero, rather than all be misclassified as errors or “ Undefined ”; some examples in Fortran syntax are ... (±0.0)(NegativeNonInteger) = +∞ , (±0.0)(NegativeEvenInteger) = +∞ , (±0.0)**(NegativeOddInteger) = ±∞ resp., ATANH(±1.0) = ±∞ resp., LOG(±0.0) = -∞.

The sign of ∞ may be accidental in some cases; for instance, if TANdeg(x) delivers the TAN of an angle x measured in degrees, then TANdeg(90.0 + 180*Integer) is infinite with a sign that depends upon details of the implementation. Perhaps that sign might best match the sign of the argument, but no such convention exists yet. ( For x in radians, accurately implemented TAN(x) need never be infinite! )

Compilers can cause accidents by evaluating expressions carelessly. For example, when y resides in a register, evaluating x-y as -(y-x) reverses the sign of zero if y = x ; evaluate it as -y + x instead. Simplifying x+ to x misbehaves when x is -0. Doing that, or printing -0 without its sign, can obscure the source of a -∞.

Operations that produce an infinite result from an infinite operand or two must not signal DIVIDE by ZERO. Examples include ∞ + 3 , ∞*∞ , EXP(+∞) , LOG(+∞) , .... Neither can 3.0/∞ = EXP(-∞) = 0.0 , ATAN(±∞) = ±π/2 , and similar examples be regarded as exceptional. Unfortunately, naive implementations of complex arithmetic can

render ∞ dangerous; for instance, when (0 + 3 ı )/∞ is turned naively into (0 + 3 ı )(∞ - ı 0)/(∞^2 + 0^2 ) it generates a NaN instead of the expected 0 ; MATLAB suffers from this affliction.

If all goes well, infinite intermediate results will turn quietly into correct finite final results that way. If all does not go well, Infinity will turn into NaN and signal INVALID. Unlike integer division by zero, for which no integer infinity nor NaN has been provided, floating-point division by zero poses no danger provided subsequent INVALID signals, if any, are heeded; in that case disabling the trap for DIVIDE by ZERO is quite safe.

...... Digression on Division-by-Zero ......

Schools teach us to abhor Division-by-Zero and to stand in awe of the Infinite. Actually, adjoining Infinity to the real numbers adds nothing worse than another exception to the familiar cancellation laws (1/x)x = x/x = 1 , x-x = 0 , among which the first is already violated by x = 0. That is a small inconvenience compared with the circumlocutions we would resort to if Infinity were outlawed. Two examples to show why are offered below.

The first example shows how Infinity eases the numerical solution of a differential equation that appears to have

no divisions in it. The problem is to compute y(10) where y(t) satisfies the Ricatti equation

dy/dt = t + y^2 for all t ≥ 0 , y(0) = 0.

Let us pretend not to know that y(t) may be expressed in terms of Bessel functions J... , whence

y(10) = -7.53121 10731 35425 34544 97349 58···. Instead a numerical method will be used to solve the differential equation approximately and as accurately as desired if enough time is spent on it.

Q(θ, t, Y) will stand for an Updating Formula that advances from any estimate Y ≈ y(t) to a later estimate

Q(θ, t, Y) ≈ y(t+θ). Vastly many updating formulas exist; the simplest that might be applied to solve the given

Ricatti equation would be Euler's formula:

Q(θ, t, Y) := Y + θ·(t + Y^2 ).

This “ First-Order ” formula converges far too slowly as stepsize θ shrinks; a faster “ Second-Order ” formula, of Runge-Kutta type, is Heun's :

f := t + Y^2 ; q := Y + θ·f ;

Q(θ, t, Y) := Y + ( f + t+θ + q 2 )·θ/.

Formulas like these are used widely to solve practically all ordinary differential equations. Every updating formula is intended to be iterated with a sequence of stepsizes θ that add up to the distance to be covered; for instance, Q(...) may be iterated N times with constant stepsize θ := 10/N to produce Y(n·θ) ≈ y(n·θ) thus:

Y(0) := y(0) ; for n = 1 to N do Y(n·θ) := Q( θ, (n-1)·θ, Y((n-1)·θ) ).

Here the number N of timesteps is chosen with a view to the desired accuracy since the error Y(10) - y(10) normally approaches 0 as N increases to Infinity. Were Euler's formula used, the error in its final estimate

Y(10) would normally decline as fast as 1/N ; were Heun's, ... 1/N^2. But the Ricatti differential equation is not normal; no matter how big the number N of steps, those formulas’ estimates Y(10) turn out to be huge positive numbers or overflows instead of -7.53···. Conventional updating formulas do not work here.

The simplest unconventional updating formula Q available turns out to be this rational formula:

Q(θ, t, Y) := Y + (t + θ + Y^2 )·θ/( 1 - θ·Y ) if |θ·Y| < ,

:= ( 1/θ + (t + θ)·θ )/( 1 - θ·Y ) - 1/θ otherwise.

The two algebraically equivalent forms are distinguished to curb rounding errors. Like Heun's, this Q is a second- order formula. ( It can be compounded into a formula of arbitrarily high order by means that lie beyond the scope of these notes.) Iterating it N times with stepsize θ := 10/N yields a final estimate Y(10) in error by roughly

(105/N)^2 even if Division-by-Zero insinuates an Infinity among the iterates Y(n·θ). Disallowing Infinity and

Division-by-Zero would at least somewhat complicate the estimation of y(10) because y(t) has to pass through

Infinity seven times as t increases from 0 to 10. ( See the graph on the next page.)

What becomes complicated is not the program so much as the process of developing and verifying a program that can dispense with Infinity. First, find a very tiny number ε barely small enough that 1 + 10 √ε rounds off to

Next, modify the foregoing rational formula for Q by replacing the divisor ( 1 - θ·Y ) in the “ otherwise ” case by ( ( 1 - θ·Y ) + ε ). Do not omit any of these parentheses; they prevent divisions by zero. Then perform an error-analysis to confirm that iterating this formula produces the same values Y(n·θ) as would be produced without ε except for replacing infinite values Y by huge finite values.

Survival without Infinity is always possible since “ Infinity ” is just a short word for a lengthy explanation. The price paid for survival without Infinity is lengthy cogitation to find a not-too-lengthy substitute, if it exists.

1 2

σ := σ + 0 ; | If σ = 0 , this ensures it is +0 , k := n ; f := 1 ; | FOR j = 1, 2, 3, ..., n IN TURN, | DO { f := ( σ - [ j ] ) - q [ j ]/ f ; | Note: This loop has no explicit k := k - SignBit( f ) } ; | tests nor branches. k(σ) := k ; f (σ) := f. |

( The function SignBit( f ) := 0 if f > 0 or if f is +0 or -0 , := 1 if f < 0 or if f is -0 ; its value at f = -0 may not matter but its value at f = -∞ does. It can be computed by using logical right-shifts or by using ( f < 0.0 ) in C , or 0.5 - SIGN(0.5, f ) in Fortran, or 0.5 - CopySign(0.5, f ) from IEEE 754. However, the use of shifts or CopySign is mandatory on computers that depart from IEEE 754 by flushing UNDERFLOWed subtractions to -0.0 instead of UNDERFLOWing Gradually, q. v. below. Through an unfortunate accident, the arguments of CopySign are reversed on Apple computers, which otherwise conform conscientiously to IEEE 754; they require SignBit( f ) := 0.5 - CopySign( f , 0.5) .)

The function f (σ) is a continued fraction :

The eigenvalues τ[ j ] of T are the zeros of f (σ) , separated by the poles of f (σ) at which it interrupts its monotonic increasing behavior to jump from +∞ to -∞. The integer function k(σ) counts the eigenvalues on either side of σ thus:

τ[1] < τ[2] < ... < τ[k(σ)] ≤ σ < τ[k(σ)+1] < ... < τ[ n ] , and τ[k(σ)] = σ just when f (σ) = 0.

Evidently the eigenvalues τ[ j ] of T are the n values of σ at which k(σ) jumps, and may be located by Binary Search accelerated by interpolative schemes that take values of f (σ) into account too. Although rounding errors can obscure f (σ) severely, its monotonicity and the jumps of k(σ) are practically unaffected, so the eigenvalues are obtained more accurately this way than any other. And quickly too.

If Infinity were outlawed, the loop would have to be encumbered by a test and branch to prevent Division-by- Zero. That test cannot overlap the division, a slow operation, because of sequential dependencies, so the test would definitely slow down the loop even though zero would be detected extremely rarely. The test's impact would be tolerable if the loop contained only one division, but that is not what happens.

Because division is so slow, fast computers pipeline it in such a way that a few divisions can be processed concurrently. To exploit this pipelining, we search for several eigenvalues simultaneously. The variable σ becomes a small array of values, as do f and k , and every go-around the loop issues an array of divisions. In this context the tests, though “ vectorized ” too, degrade speed by 25% or more, much more on machines with multiple pipelines that can subtract and shift concurrently with division, regardless of what else a branch would entail whenever a zero divisor were detected. By dispensing with those tests, this program gains speed and simplicity from Infinity even if Division-by-Zero never happens. End of second example.

a

f ( σ) σ – a [ n ]

b [ n ]

σ – a [ n – 1 ]

b [ n – 1 ]

σ – a [ n – 2 ]

b [ n – 2 ]

σ – a [ n – 3 ]– …

b [ 2 ]

σ – a [ 1 ]

How many such examples are there? Nobody knows how many programs would benefit from Infinity because it remains unsupported by programming language standards, and hence by most compilers, though support in hardware has been abundant for over a decade. To get some idea of the impediment posed by lack of adequate support, try to program each of the foregoing two examples in a way that will compile correctly on every machine whose hardware conforms to IEEE 754. That ordeal will explain why few programmers experiment with Infinity, whence few programs use it.

In my experience with a few compilers that support IEEE 754 on PCs and Macintoshes, Infinity and NaNs confer their greatest benefits by simplifying test programs, which already tend to grossly worse complexity than the software they are designed to validate. Consequently my programs enjoy enhanced reliability because of IEEE 754 regardless of whether it is in force where they run.

...... End of Digression ......

Exception: OVERFLOW.

This happens after an attempt to compute a finite result that would lie beyond the finite range of the floating-point format for which it is destined. The default specified in IEEE 754 is to approximate that result by an appropriately signed Infinity. Since it is approximate, OVERFLOW is also INEXACT. Often that approximation is worthless; it is almost always worthless in matrix computations, and soon turns into NaN or, worse, misleading numbers. Consequently OVERFLOW is often trapped to abort seemingly futile computation sooner rather than later.

Actually, OVERFLOW more often implies that a different computational path should be chosen than that no path leads to the desired goal. For example, if the Fortran expression x / SQRT(xx + yy) encounters OVERFLOW before the SQRT can be computed, it should be replaced by something like (sx) / SQRT( (sx)(sx) + (sy)(s*y) ) with a suitably chosen tiny positive Scale-Factor s. ( A power of 2 avoids roundoff.) The cost of computing and applying s beforehand could be deemed the price paid for insurance against OVERFLOW. Is that price too high?

The biggest finite IEEE 754 Double is almost 1.8 e308 , which is so huge that OVERFLOW occurs extremely rarely if not yet rarely enough to ignore. The cost of defensive tests, branches and scaling to avert OVERFLOW seems too high a price to pay for insurance against an event that hardly ever happens. A lessened average cost will be incurred in most situations by first running without defensive scaling but with a judiciously placed test for OVERFLOW ( and for severe UNDERFLOW ); in the example above the test should just precede the SQRT. Only when necessary need scaling be instituted. Thus our treatment of OVERFLOW has come down to this question: how best can OVERFLOW be detected?

The ideal test for OVERFLOW tests its flag; but that flag cannot be mentioned in most programming languages for lack of a name. Next best are tests for Infinities and NaNs consequent upon OVERFLOW, but prevailing programming languages lack names for them; suitable tests have to be contrived. For example, the C predicate (z != z) is True only when z is NaN and the compiler has not “ optimized ” overzealously; (1.0e37 /(1 + fabs(z)) == 0) is True only when z is infinite; and (z-z != 0) is True only when z is NaN or infinite, the INVALID trap has been disabled, and optimization is not overzealous.

A third way to detect OVERFLOW is to enable its trap and attach a handler to it. Even if a programming language in use provides control structures for this purpose, this approach is beset by hazards. The worst is the possibility that the handler may be entered inadvertently from unanticipated places. Another hazard arises from the concurrent execution of integer and floating-point operations; by the time an OVERFLOW has been detected, data associated with it may have become inaccessible because of changes in pointers and indices. Therefore this approach works only when a copy of the data has been saved to be reprocessed by a different method than the one thwarted by OVERFLOW, and when the scope of the handler has been properly localized; note that the handler must be detached before and reattached after functions that handle their own OVERFLOWs are executed.

Though afflicted by fewer anomalies, Gradual UNDERFLOW is not free of them. For instance, it is possible to have x/y == 0.95 coexist with (xz)/(yz) == 0.5 because xz and probably also yz UNDERFLOWed to Subnormal numbers; without Subnormals the last quotient turns into either an ephemeral 0.0 or a persistent NaN ( INVALID 0/0 ). Thus, UNDERFLOW cannot be ignored entirely whether Gradual or not.

UNDERFLOWs are uncommon. Even if flushed to zero they rarely matter; if handled Gradually they cause harm extremely rarely. That harmful remnant has to be treated much as OVERFLOWs are, with testing and scaling, or trapping, etc. However, the most common treatment is to ignore UNDERFLOW and then to blame whatever harm is caused by doing so upon poor taste in someone else's choice of initial data.

UNDERFLOWs resemble ants; where there is one there are quite likely many more, and they tend to come one after another. That tendency has no direct effect upon the i387-486-Pentium nor M68881/2, but it can severely retard computation on other implementations of IEEE 754 that have to trap to software to UNDERFLOW Gradually for lack of hardware to do it. They take extra time to Denormalize after UNDERFLOW and/or, worse, to prenormalize Subnormals before multiplication or division. ( Gradual UNDERFLOW requires no prenormalization before addition or subtraction of numbers with the same format, but computers usually do it anyway if they have to trap Subnormals.) Worse still is the threat of traps, whether they occur or not, to machines like the DEC Alpha that cannot enable traps without hampering pipelining and/or concurrency; such machines are slowed also by Gradual UNDERFLOWs that do not occur!

Why should we care about such benighted machines? They pose dilemmas for developers of applications software designed to be portable (after recompilation) to those machines as well as ours. To avoid sometimes severe performance degradation by Gradual UNDERFLOW, developers will sometimes resort to simple-minded alternatives. The simplest violates IEEE 754 by flushing every UNDERFLOW to 0.0 , and computers are being sold that flush by default. ( DEC Alpha is a recent example; it is advertised as conforming to IEEE 754 without mention of how slowly it runs with traps enabled for full conformity.) Applications designed with flushing in mind may, when run on ix87s and Macs, have to enable the UNDERFLOW trap and provide a handler that flushes to zero, thereby running slower to get generally worse results! ( This is what MathCAD does on PCs and on Macintoshes.) Few applications are designed with flushing in mind nowadays; since some of these might malfunction if UNDERFLOW were made Gradual instead, disabling the ix87 UNDERFLOW trap to speed them up is not always a good idea.

A format’s usable exponent range is widened by almost its precision N to fully ± 2 K^ as a by-product of Gradual Underflow; were this its sole benefit, its value to formats wider than Single could not justify its price. Compared with Flush-to-Zero, Gradual Underflow taxes performance unless designers expend time and ingenuity or else hardware. Designers incapable of one of those expenditures but willing to cut a corner off IEEE 754 exacerbate market fragmentation, which costs the rest of us cumulatively far more than whatever they saved.

...... Digression on Gradual Underflow ......

To put things in perspective, here is an example of a kind that, when it appears in benchmarks, scares many people into choosing Flush-to-Zero rather than Gradual UNDERFLOW. To simulate the diffusion of heat through a conducting plate with edges held at fixed temperatures, a rectangular mesh is drawn on the plate and temperatures are computed only at intersections. The finer the mesh, the more accurate is the simulation. Time is discretized too; at each timestep, temperature at every interior point is replaced by a positively weighted average of that point's temperature and those of nearest neighbors. Simulation is more accurate for smaller timesteps, which entail larger numbers of timesteps and tinier weights on neighbors; typically these weights are smaller than 1/8 , and timesteps number in the thousands.

When edge temperatures are mostly fixed at 0 , and when interior temperatures are mostly initialized to 0 , then at every timestep those nonzero temperatures next to zeros get multiplied by tiny weights as they diffuse to their neighbors. With fine meshes, large numbers of timesteps can pass before nonzero temperatures have diffused almost everywhere, and then tiny weights can get raised to large powers, so UNDERFLOWs are numerous. If UNDERFLOW is Gradual, denormalization will produce numerous subnormal numbers; they slow computation badly on a computer designed to handle subnormals slowly because the designer thought they would be rare. Flushing UNDERFLOW to zero does not slow computation on such a machine; zeros created that way may speed it up.

When this simulation figures in benchmarks that test computers' speeds, the temptation to turn slow Gradual UNDERFLOW Off and fast Flush-to-Zero On is more than a marketing manager can resist. Compiler vendors succumb to the same temptation; they make Flush-to-Zero their default. Such practices bring to mind calamitous explosions that afflicted high-pressure steam boilers a century or so ago because attendants tied down noisy over- pressure relief valves the better to sleep undisturbed. +---------------------------------------------------------------------+ | Vast numbers of UNDERFLOWs usually signify that something about | | a program or its data is strange if not wrong; this signal should | | not be ignored, much less squelched by flushing UNDERFLOW to 0.| +---------------------------------------------------------------------+ What is strange about the foregoing simulation is that exactly zero temperatures occur rarely in Nature, mainly at the boundary between colder ice and warmer water. Initially increasing all temperatures by some negligible

amount, say 10-30^ , would not alter their physical significance but it would eliminate practically all UNDERFLOWs and so render their treatment, Gradual or Flush-to-Zero, irrelevant.

To use such atypical zero data in a benchmark is justified only if it is intended to expose how long some hardware takes to handle UNDERFLOW and subnormal numbers. Unlike many other floating-point engines, the i387 and its successors are slowed very little by subnormal numbers; we should thank Intel's engineers for that and enjoy it rather than resort to an inferior scheme which also runs slower on the i387-etc.

...... End of Digression ......

When UNDERFLOW's trap is enabled, the IEEE 754 default Gradual Underflow does not occur; the result's exponent is “ wrapped ” instead, which means in this case that the delivered result has an exponent too big by an

amount 2K-1·3 that depends upon its format:

Double-Extended ... too big by 24576 ; 224576 = 1.3 E 7398 Double ... too big by 1536 ; 21536 = 2.4 E 462 Single ... too big by 192 ; 2192 = 6.3 E 57 ( Though required by IEEE 754, the latter two wraps are not performed by ix87 nor 680x nor some other machines without help from suitable trap-handling software. )

In effect, the delivered result has been multiplied by a power of 2 so huge as to turn what would have underflowed into a relatively big but predictable quantity that a trap-handler can reinterpret. If there is no trap handler, computation will proceed with that bigger quantity or, in the case of format-converting FSTore instructions, without storing anything.

Exponent wrapping provides the fastest and most accurate way to compute extended products and quotients like

Q

( a 1 + b 1 ) ⋅ ( a 2 + b 2 ) ⋅ ( a 3 + b 3 ) ⋅ ( …) ⋅( a N + b N )

( c 1 + d 1 ) ⋅ ( c 2 + d 2 ) ⋅ ( c 3 + d 3 ) ⋅ ( …) ⋅( c M + d M )

The “ Directed ” roundings can be used to implement Interval Arithmetic, which is a scheme that approximate every variable not by one value of unknown reliability but by two that are guaranteed to straddle the ideal value. This scheme is not so popular in the U.S.A. as it is in parts of Europe, where some people distrust computers.

Control-Word control of rounding modes allows software modules to be re-run in different rounding modes without recompilation. This cannot be done with some other computers, notably DEC Alpha, that can set two bits in every instruction to control rounding direction at compile-time; that is a mistake. It is worsened by the designers' decision to take rounding direction from a Control-Word when the two bits are set to what would otherwise have been one of the directed roundings; had Alpha obtained only the round-to-nearest mode from the Control-Word, their mistake could have been transformed into an advantageous feature.

All these rounding modes round to a value drawn from the set of values representable either with the precision of the destination or selected by rounding precision control to be described below. The sets of representable values were spelled out in tables above. The direction of rounding can also affect OVER/UNDERFLOW ; a positive quantity that would OVERFLOW to +∞ in the default mode will turn into the format's biggest finite floating- point number when rounded towards -∞. And the expression “ X - X ” delivers +0.0 for every finite X in all rounding modes except for rounding directed towards -∞, for which -0.0 is delivered instead. These details are designed to make Interval Arithmetic work better.

Ideally, software that performs Binary-Decimal conversion ( both ways ) should respect the requested direction of rounding. David Gay of AT&T Bell Labs has released algorithms that do so into the public domain ( Netlib ); to use less accurate methods now is a blunder.

Precisions of Rounding:

IEEE 754 obliges only machines that compute in the Extended ( long double or REAL10 ) format to let programmers control the precision of rounding from a Control-Word. This lets ix87 or M680x0 emulate the roundoff characteristics of other machines that conform to IEEE 754 but support only Single ( C's float, or REAL4 ) and Double ( C's double, or REAL*8 ), not Extended. Software developed and checked out on one of those machines can be recompiled for a 680x0 or ix87 and, if anomalies excite concerns about differences in roundoff, the software can be run very nearly as if on its original host without sacrificing speed on the 680x or ix87. Conversely, software developed on these machines but without explicit mention of Extended can be rerun in a way that portends what it will do on machines that lack Extended. Precision Control rounds to 24 sig. bits to emulate Single, to 53 sig. bits to emulate Double, leaving zeros in the rest of the 64 sig. bits of the Extended format.

The emulation is imperfect. Transcendental functions are unlikely to match. Although Decimal -> Binary conversion must round to whatever precision is set by the Control-Word, Binary -> Decimal should ideally be unaffected since its precision is determined solely by the destination’s format, but ideals are not always attained. Some OVER/UNDERFLOWs that would occur on those other machines need not occur on the ix87 ; IEEE 754 allows this, perhaps unwisely, to relieve hardware implementors of details formerly thought unimportant.

Few compilers expose the Control-Word to programmers. Worse, some compilers have revived a nasty bug that emerged when Double-Precision first appeared among Fortran compilers; it goes like this: Consider S = X T = ( S - Y )/(... ) in a Fortran program where S is SINGLE PRECISION, and X and Y are DOUBLE or EXTENDED PRECISION variables or expressions computed in registers. Compilers that supplant S by X in the second statement save the time required to reload S from memory but spoil T. Though S and X differ by merely a rounding error, the difference matters.

The Baleful Influence of Benchmarks:

Hardware and compilers are increasingly being rated entirely according to their performance in benchmarks that measure only speed. That is a mistake committed because speed is so much easier to measure than other qualities like reliability and convenience. Sacrificing them in order to run faster will compel us to run longer. By disregarding worthwhile qualities other than speed, current benchmarks penalize conscientious adherence to standards like IEEE 754; worse, attempts to take those qualities into account are thwarted by political constraints imposed upon programs that might otherwise qualify as benchmarks.

For example, a benchmark should compile and run on every commercially significant computer system. This rules out our programs for solving the differential equation and the eigenvalue problem described above under the Digression on Division-by-Zero. To qualify as benchmarks, programs must prevent exceptional events that might stop or badly slow some computers even if such prevention retards performance on computers that, by conforming conscientiously to IEEE 754, would not stop.

The Digression on Gradual Underflow offered an example of a benchmark that lent credibility to a misguided preference for Flush-to-Zero, in so far as it runs faster than Gradual Underflow on some computers, by disregarding accuracy. If Gradual Underflow's superior accuracy has no physical significance there, neither has the benchmark's data.

Accuracy poses tricky questions for benchmarks. One hazard is the ...

Stopped Clock Paradox: Why is a mechanical clock more accurate stopped than running?

A running clock is almost never exactly right, whereas a stopped clock is exactly right twice a day. ( But WHEN is it right? Alas, that was not the question .)

The computational version of this paradox is a benchmark that penalizes superior computers, that produce merely excellent approximate answers, by making them seem less accurate than an inferior computer that gets exactly the right answer for the benchmark's problem accidentally. Other hazards exist too; some will be illustrated by the next example.

Quadratic equations like

p x 2 - 2 q x + r = 0 arise often enough to justify tendering a program that solves it to serve as a benchmark. When the equation's roots x1 and x2 are known in advance both to be real, the simplest such program is the procedure Qdrtc exhibited on the next page.

In the absence of premature Over/Underflow, Qdrtc computes x1 and x2 at least about as accurately as they are determined by data { p, q, r } uncorrelatedly uncertain in their last digits stored. It should be tested first on trivial data to confirm that it has not been corrupted by a misprint nor by an ostensible correction like “ x1 := (q+s)/p ; x2 := (q-s)/p ” copied naively from some elementary programming text. Here are some trivial data:

{ p = Any nonzero, q = r = 0 }; x1 = x2 = 0.

{ p = 2.0 , q = 5.0 , r = 12.0 }; x1 = 2.0 , x2 = 3..

{ p = 2.0 E-37, q = 1.0 , r = 2.0 }; x1 ≈ 1.0 , x2 ≈ 1.0 E 37.

Swapping p with q swaps { x1, x2 } with { 1/x2, 1/x1 }.

{μp, μq, μ*r} yields {x1, x2} independently of nonzero μ.

IEEE Standard 754 for Binary Floating-Point Arithmetic - Lecture Notes | CS 1541, Study notes of Computer Science

Related documents

Partial preview of the text

Download IEEE Standard 754 for Binary Floating-Point Arithmetic - Lecture Notes | CS 1541 and more Study notes Computer Science in PDF only on Docsity!

Lecture Notes on the Status of

IEEE Standard 754 for Binary Floating-Point Arithmetic

Introduction:

Representable Numbers:

-+- Consecutive Positive Floating-Point Numbers -+-

Span and Precision of IEEE 754 Floating-Point Formats :

Format Min. Subnormal Min. Normal Max. Finite 2 -N^ Sig. Dec.

Multiply-Accumulate , a Mixed Blessing:

Exception: INVALID operation.

Exception: DIVIDE by ZERO.

...... Digression on Division-by-Zero ......

no divisions in it. The problem is to compute y(10) where y(t) satisfies the Ricatti equation

dy/dt = t + y^2 for all t ≥ 0 , y(0) = 0.

Let us pretend not to know that y(t) may be expressed in terms of Bessel functions J... , whence

Q(θ, t, Y) will stand for an Updating Formula that advances from any estimate Y ≈ y(t) to a later estimate

Q(θ, t, Y) ≈ y(t+θ). Vastly many updating formulas exist; the simplest that might be applied to solve the given

Q(θ, t, Y) := Y + θ·(t + Y^2 ).

f := t + Y^2 ; q := Y + θ·f ;

Q(θ, t, Y) := Y + ( f + t+θ + q 2 )·θ/.

Q(θ, t, Y) := Y + (t + θ + Y^2 )·θ/( 1 - θ·Y ) if |θ·Y| < ,

:= ( 1/θ + (t + θ)·θ )/( 1 - θ·Y ) - 1/θ otherwise.

Division-by-Zero would at least somewhat complicate the estimation of y(10) because y(t) has to pass through

Infinity seven times as t increases from 0 to 10. ( See the graph on the next page.)

a

f ( σ) σ – a [ n ]

b [ n ]

σ – a [ n – 1 ]

b [ n – 1 ]

σ – a [ n – 2 ]

b [ n – 2 ]

σ – a [ n – 3 ]– …

b [ 2 ]

σ – a [ 1 ]

Exception: OVERFLOW.

...... Digression on Gradual Underflow ......

Q

( a 1 + b 1 ) ⋅ ( a 2 + b 2 ) ⋅ ( a 3 + b 3 ) ⋅ ( …) ⋅( a N + b N )

( c 1 + d 1 ) ⋅ ( c 2 + d 2 ) ⋅ ( c 3 + d 3 ) ⋅ ( …) ⋅( c M + d M )

Precisions of Rounding:

The Baleful Influence of Benchmarks:

Stopped Clock Paradox: Why is a mechanical clock more accurate stopped than running?