Rounding Error Analysis in Floating Point Arithmetic - Prof. Zhaojun Bai, Study notes of Computer Science

An analysis of rounding errors in floating point arithmetic, focusing on the behavior of addition, subtraction, multiplication, and division. It discusses the concept of catastrophic cancellation and its impact on the approximation of sums and differences of large numbers. The document also introduces forward and backward error analysis as methods to understand and quantify errors in floating point computations.

Typology: Study notes

Pre 2010

Uploaded on 07/31/2009

koofers-user-5nc
koofers-user-5nc 🇺🇸

10 documents

1 / 3

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
ECS231 Handout 3 Rounding Error Analysis April 2, 2009
1. Let ˆxand ˆybe the floating point numbers and that
ˆx=x(1 + τ1) and ˆy=y(1 + τ2),for |τi| τ1
where τicould be the relative errors in the process of “collecting/getting” the
data from the original source or the previous operations.
Question: how do the four basic arithmetic operations behave?
2. Addition and subtraction
fl(ˆx+ ˆy) = (ˆx+ ˆy)(1 + δ),|δ| 1
2ǫ
=x(1 + τ1)(1 + δ) + y(1 + τ2)(1 + δ)
=x+y+x(τ1+δ+O(τǫ)) + y(τ2+δ+O(τǫ))
= (x+y)µ1 + x
x+y(τ1+δ+O(τǫ)) + y
x+y(τ2+δ+O(τǫ))
(x+y)(1 + ˆ
δ),
where ˆ
δcan be bounded as follows:
|ˆ
δ| |x|+|y|
|x+y|µτ+1
2ǫ+O(τǫ).
Three possible cases:
If xand yhave the same sign, i.e., xy > 0, then |x+y|=|x|+|y|; this
implies
|ˆ
δ| τ+1
2ǫ+O(τǫ)1.
Thus fl(ˆx+ ˆy) approximates x+ywell.
If x y |x+y| 0, then (|x|+|y|)/|x+y| 1; this implies
that |ˆ
δ|could be nearly or much bigger than 1. Thus fl(ˆx+ ˆy) may
turn out to have nothing to do with the true x+y. This is so called
catastrophic cancellation which happens when a floating point number is
subtracted from another nearly equal floating point number. Cancellation
causes relative errors or uncertainties already presented in ˆxand ˆyto be
magnified.
In general, if (|x|+|y|)/|x+y|is not too big, fl(ˆx+ ˆy) provides a good
approximation to x+y.
3. Multiplication and Division are very well-behaved.
fl(ˆxˆy) = (ˆx׈y)(1 + δ) = xy(1 + τ1)(1 + τ2)(1 + δ)xy(1 + ˆ
δ×),
fl(ˆx/ ˆy) = x/ ˆy)(1 + δ) = (x/y)(1 + τ1)(1 + τ2)1(1 + δ)xy(1 + ˆ
δ÷),
where ˆ
δ×=τ1+τ2+δ+O(τǫ),ˆ
δ÷=τ1τ2+δ+O(τǫ).
Thus |ˆ
δ×| 2τ+1
2ǫ+O(τǫ) and |ˆ
δ÷| 2τ+1
2ǫ+O(τǫ).
1
pf3

Partial preview of the text

Download Rounding Error Analysis in Floating Point Arithmetic - Prof. Zhaojun Bai and more Study notes Computer Science in PDF only on Docsity!

ECS231 Handout 3 Rounding Error Analysis April 2, 2009

  1. Let ˆx and ˆy be the floating point numbers and that

xˆ = x(1 + τ 1

) and yˆ = y(1 + τ 2

), for |τ i

| ≤ τ ≪ 1

where τi could be the relative errors in the process of “collecting/getting” the

data from the original source or the previous operations.

Question: how do the four basic arithmetic operations behave?

  1. Addition and subtraction

fl(ˆx + ˆy) = (ˆx + ˆy)(1 + δ), |δ| ≤

ǫ

= x(1 + τ 1

)(1 + δ) + y(1 + τ 2

)(1 + δ)

= x + y + x(τ 1 + δ + O(τ ǫ)) + y(τ 2 + δ + O(τ ǫ))

= (x + y)

(

x

x + y

(τ 1

  • δ + O(τ ǫ)) +

y

x + y

(τ 2

  • δ + O(τ ǫ))

)

≡ (x + y)(1 +

δ),

where

δ can be bounded as follows:

δ| ≤

|x| + |y|

|x + y|

(

τ +

ǫ + O(τ ǫ)

)

Three possible cases:

  • If x and y have the same sign, i.e., xy > 0, then |x + y| = |x| + |y|; this

implies

δ| ≤ τ +

ǫ + O(τ ǫ) ≪ 1.

Thus fl(ˆx + ˆy) approximates x + y well.

  • If x ≈ −y ⇒ |x + y| ≈ 0, then (|x| + |y|)/|x + y| ≫ 1; this implies

that |

δ| could be nearly or much bigger than 1. Thus fl(ˆx + ˆy) may

turn out to have nothing to do with the true x + y. This is so called

catastrophic cancellation which happens when a floating point number is

subtracted from another nearly equal floating point number. Cancellation

causes relative errors or uncertainties already presented in ˆx and ˆy to be

magnified.

  • In general, if (|x| + |y|)/|x + y| is not too big, fl(ˆx + ˆy) provides a good

approximation to x + y.

  1. Multiplication and Division are very well-behaved.

fl(ˆx ∗ yˆ) = (ˆx × ˆy)(1 + δ) = xy(1 + τ 1 )(1 + τ 2 )(1 + δ) ≡ xy(1 +

δ×),

fl(ˆx/yˆ) = (ˆx/ˆy)(1 + δ) = (x/y)(1 + τ 1

)(1 + τ 2

− 1 (1 + δ) ≡ xy(1 +

δ ÷

where

δ ×

= τ 1

  • τ 2

  • δ + O(τ ǫ),

δ ÷

= τ 1

− τ 2

  • δ + O(τ ǫ).

Thus |

δ ×

| ≤ 2 τ +

1

2

ǫ + O(τ ǫ) and |

δ ÷

| ≤ 2 τ +

1

2

ǫ + O(τ ǫ).

  1. Examples of catastrophic cancellation

Example 1. Computing

n + 1 −

n straightforward causes substantial loss

of significant digits for large n

n fl(

n + 1) fl(

n) fl(fl(

n + 1) − fl(

n)

1.00e+10 1.00000000004999994e+05 1.00000000000000000e+05 4.99999441672116518e-

1.00e+11 3.16227766018419061e+05 3.16227766016837908e+05 1.58115290105342865e-

1.00e+12 1.00000000000050000e+06 1.00000000000000000e+06 5.00003807246685028e-

1.00e+13 3.16227766016853740e+06 3.16227766016837955e+06 1.57859176397323608e-

1.00e+14 1.00000000000000503e+07 1.00000000000000000e+07 5.02914190292358398e-

1.00e+15 3.16227766016838104e+07 3.16227766016837917e+07 1.86264514923095703e-

1.00e+16 1.00000000000000000e+08 1.00000000000000000e+08 0.00000000000000000e+

Catastrophic cancellation can sometimes be avoided if a formula is properly

reformulated. In the present case, one can compute

n + 1 −

n almost to

full precision by using the equality

n + 1 −

n =

n + 1 +

n

Consequently, the computed results are

n fl(1/(

n + 1 +

n))

1.00e+10 4.999999999875000e-

1.00e+11 1.581138830080237e-

1.00e+12 4.999999999998749e-

1.00e+13 1.581138830084150e-

1.00e+14 4.999999999999987e-

1.00e+15 1.581138830084189e-

1.00e+16 5.000000000000000e-

In fact, one can show that fl(1/(

n + 1 +

n)) = (

n + 1 −

n)(1 + δ), where

|δ| ≤ 5 ǫ + O(ǫ

2 ) (try it!)

Example 2. Consider the function

f (x) =

1 − cos x

x

2

(

sin(x/2)

x/ 2

) 2

Note that

0 ≤ f (x) < 1 / 2 for all x 6 = 0.

Compare the computed values for x = 1. 2 × 10

− 5 using the above two expres-

sions (assume that the value of cos x rounded to 10 significant figures).

  1. Forward and backward error analysis

We illustrate the basic idea through a simple example. Consider the compu-

tation of an inner product of two vector x, y ∈ R

3

x

T

y

def

= x 1 y 1 + x 2 y 2 + x 3 y 3 ,

assuming already xi’s and yj ’s are floating point numbers. It is likely that

fl(x · y) is computed in the following order.

fl(x

T

y) = fl( fl(fl(x 1

y 1

) + fl(x 2

y 2

)) + fl(x 3

y 3