Lecture Slides on Floating Point - Computer Systems and Program | CS 367, Study notes of Computer Science

Material Type: Notes; Professor: Carver; Class: Computer Systems and Programm; Subject: Computer Science; University: George Mason University; Term: Unknown 1989;

Typology: Study notes

Pre 2010

Uploaded on 02/12/2009

koofers-user-2r9
koofers-user-2r9 🇺🇸

10 documents

1 / 16

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
Floating Point
Topics
Topics
IEEE Floating Point Standard
Rounding
Floating Point Operations
Mathematical properties
CS 367
– 2 – CS 367
Floating Point Puzzles
For each of the following C expressions, either:
Argue that it is true for all argument values
Explain why not true
x == (int)(float) x
x == (int)(double) x
f == (float)(double) f
d == (float) d
f == -(-f);
2/3 == 2/3.0
d < 0.0 ((d*2) < 0.0)
d > f -f > -d
d * d >= 0.0
(d+f)-d == f
int x = …;
float f = …;
double d = …;
Assume neither
d nor f is NaN
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff

Partial preview of the text

Download Lecture Slides on Floating Point - Computer Systems and Program | CS 367 and more Study notes Computer Science in PDF only on Docsity!

Floating Point

Topics

Topics

 IEEE Floating Point Standard

 Rounding

 Floating Point Operations

 Mathematical properties

CS 367

  • 2 – CS 367

Floating Point Puzzles

 For each of the following C expressions, either:

 Argue that it is true for all argument values

 Explain why not true

  • x == (int)(float) x
  • x == (int)(double) x
  • f == (float)(double) f
  • d == (float) d
  • f == -(-f);
  • d < 0.0((d2) < 0.0)*
  • d > f-f > -d
  • d * d >= 0.
  • (d+f)-d == f

int x = …;

float f = …;

double d = …;

Assume neither

d nor f is NaN

  • 3 –

CS 367

IEEE Floating Point

IEEE Standard 754

IEEE Standard 754

 Established in 1985 as uniform standard for floating point

arithmetic

 Before that, many idiosyncratic formats

 Supported by all major CPUs

Driven by Numerical ConcernsDriven by Numerical Concerns

 Nice standards for rounding, overflow, underflow

 Hard to make go fast

 Numerical analysts predominated over hardware types in

defining standard

  • 4 – CS 367

Fractional Binary Numbers

Representation

Representation

 Bits to right of “binary point” represent fractional powers of 2

 Represents rational number:

b

i

b

i – 1

b

2

b

1

b

0

b

  • 1

b

  • 2

b

  • 3

b

  • j

i – 1

i

  • j

b

k

k

k =" j

i

  • 7 –

CS 367

Numerical Form

Numerical Form

s

M 2

E

Sign bit s determines whether number is negative or positive

Significand M normally a fractional value in range [1.0,2.0).

Exponent E weights value by power of two

Encoding

Encoding

 MSB is sign bit

 exp field encodes E

 frac field encodes M

Floating Point Representation

s exp frac

  • 8 – CS 367

Encoding

Encoding

 MSB is sign bit

 exp field encodes E

 frac field encodes M

Sizes

Sizes

 Single precision: 8 exp bits, 23 frac bits

32 bits total

 Double precision: 11 exp bits, 52 frac bits

64 bits total

 Extended precision: 15 exp bits, 63 frac bits

Only found in Intel-compatible machines

Stored in 80 bits

» 1 bit wasted

Floating Point Precisions

s exp frac

  • 9 –

CS 367

“Normalized” Numeric Values

Condition

Condition

 exp ≠ 000 … 0 and exp ≠ 111 … 1

Exponent coded as

Exponent coded as

biased

biased

value

value

E = Exp – Bias

Exp : unsigned value denoted by exp

Bias : Bias value

» Single precision: 127 ( Exp : 1…254, E : -126…127)

» Double precision: 1023 ( Exp : 1…2046, E : -1022…1023)

» in general: Bias = 2

e-

  • 1, where e is number of exponent bits

SignificandSignificand coded with implied leading 1coded with implied leading 1

M = 1.xxx … x

2

 xxx…x: bits of frac

Minimum when 000 … 0 ( M = 1.0)

Maximum when 111 … 1 ( M = 2.0 – ε)

Get extra leading bit for “free”

  • 10 – CS 367

Normalized Encoding Example

Value

Value

Float F = 15213.0;

10

2

2

X 2

13

Significand

Significand

M = 1. 1101101101101

2

frac = 11011011011010000000000

2

Exponent

Exponent

E = 13

Bias = 127

Exp = 140 = 10001100

2

Floating Point Representation (Class 02):

Hex: 4 6 6 D B 4 0 0

Binary: 0100 0110 0110 1101 1011 0100 0000 0000

  • 13 –

CS 367

Summary of Floating Point

Real Number Encodings

NaN

NaN

-Normalized -Denorm +Denorm +Normalized

  • 14 – CS 367

Tiny Floating Point Example

8-bit Floating Point Representation

8-bit Floating Point Representation

 the sign bit is in the most significant bit.

 the next four bits are the exponent, with a bias of 7.

 the last three bits are the frac

Same General Form as IEEE Format

Same General Form as IEEE Format

 normalized, denormalized

 representation of 0, NaN, infinity

s

exp frac

76 32 0

  • 15 –

CS 367

Values Related to the Exponent

Exp exp E 2

E

0 0000 -6 1/64 (denorms)

15 1111 n/a (inf, NaN)

  • 16 – CS 367

Dynamic Range

s exp frac E Value

0 1111 000 n/a inf

closest to zero

largest denorm

smallest norm

closest to 1 below

closest to 1 above

largest norm

Denormalized

numbers

Normalized

numbers

  • 19 –

CS 367

Interesting Numbers

DescriptionDescription expexp fracfrac Numeric ValueNumeric Value

Zero

Zero 00

Smallest Pos.Smallest Pos. DenormDenorm.. 0000 …… 0000 0000 …… 0101 22

  • – {23,52}{23,52}

X 2X 2

  • – {126,1022}{126,1022}

 Single ≈ 1.4 X 10

  • 45

 Double ≈ 4.9 X 10

  • 324

LargestLargest DenormalizedDenormalized 0000 …… 0000 1111 …… 1111 (1.0(1.0 – – εε) X 2) X 2

  • – {126,1022}{126,1022}

 Single ≈ 1.18 X 10

  • 38

 Double ≈ 2.2 X 10

  • 308

Smallest Pos. NormalizedSmallest Pos. Normalized 0000 …… 0101 0000 …… 0000 1.0 X 21.0 X 2

{126,1022}

{126,1022}

 Just larger than largest denormalized

One

One 01

Largest NormalizedLargest Normalized 1111 …… 1010 1111 …… 1111 (2.0(2.0 – – εε) X 2) X 2

{127,1023}

{127,1023}

 Single ≈ 3.4 X 10

38

 Double ≈ 1.8 X 10

308

  • 20 – CS 367

Special Properties of Encoding

FP Zero Same as Integer Zero

FP Zero Same as Integer Zero

 All bits = 0

Can (Almost) Use Unsigned Integer Comparison

Can (Almost) Use Unsigned Integer Comparison

 Must first compare sign bits

 Must consider -0 = 0

 NaNs problematic

 Will be greater than any other values

 What should comparison yield?

 Otherwise OK

 Denorm vs. normalized

 Normalized vs. infinity

  • 21 –

CS 367

Floating Point Operations

Conceptual View

Conceptual View

 First compute exact result

 Make it fit into desired precision

Possibly overflow if exponent too large

Possibly round to fit into frac

Rounding Modes (illustrate with $ rounding)Rounding Modes (illustrate with $ rounding)

 Zero $1 $1 $1 $2 – $

 Round down (-∞) $1 $1 $1 $2 – $

 Round up (+∞) $2 $2 $2 $3 – $

 Nearest Even (default) $1 $2 $2 $2 – $

Note:

  1. Round down: rounded result is close to but no greater than true result.
  2. Round up: rounded result is close to but no less than true result.
  • 22 – CS 367

Closer Look at Round-To-Even

Default Rounding Mode

Default Rounding Mode

 Hard to get any other kind without dropping into assembly

 All others are statistically biased

Sum of set of positive numbers will consistently be over- or under-

estimated

Applying to Other Decimal Places / Bit Positions

Applying to Other Decimal Places / Bit Positions

 When exactly halfway between two possible values

Round so that least significant digit is even

 E.g., round to nearest hundredth

1.2349999 1.23 (Less than half way)

1.2350001 1.24 (Greater than half way)

1.2350000 1.24 (Half way—round up)

1.2450000 1.24 (Half way—round down)

  • 25 –

CS 367

FP Addition

Operands

Operands

s

M1 2

E

s

M2 2

E

 Assume E1 > E

Exact Result

Exact Result

s

M 2

E

 Sign s, significand M:

 Result of signed align & add

 Exponent E: E

FixingFixing

 If M ≥ 2, shift M right, increment E

 if M < 1, shift M left k positions, decrement E by k

 Overflow if E out of range

 Round M to fit frac precision

s

M

s

M

E1– E

s

M

  • 26 – CS 367

Mathematical Properties of FP Add

Compare to those of

Compare to those of

Abelian

Abelian

Group

Group

 Closed under addition? YES

But may generate infinity or NaN

 Commutative? YES

 Associative? NO

Overflow and inexactness of rounding

 0 is additive identity? YES

 Every element has additive inverse ALMOST

Except for infinities & NaNs

MonotonicityMonotonicity

 a ≥ b ⇒ a+ c ≥ b+ c? ALMOST

Except for infinities & NaNs

  • 27 –

CS 367

Math. Properties of FP Mult

Compare to Commutative Ring

Compare to Commutative Ring

 Closed under multiplication? YES

But may generate infinity or NaN

 Multiplication Commutative? YES

 Multiplication is Associative? NO

Possibility of overflow, inexactness of rounding

 1 is multiplicative identity? YES

 Multiplication distributes over addition? NO

Possibility of overflow, inexactness of rounding

Monotonicity

Monotonicity

 a ≥ b & c ≥ 0 ⇒ a * c ≥ b * c? ALMOST

Except for infinities & NaNs

  • 28 – CS 367

Floating Point in C

C Guarantees Two Levels

C Guarantees Two Levels

float single precision

double double precision

Conversions

Conversions

 Casting between int , float , and double changes numeric

values

 Double or float to int

 Truncates fractional part

 Like rounding toward zero

 Not defined when out of range

» Generally saturates to TMin or TMax

 int to double

 Exact conversion, as long as int has 53 bit word size

 int to float

 Will round according to rounding mode

  • 31 –

CS 367

Summary

IEEE Floating Point Has Clear Mathematical PropertiesIEEE Floating Point Has Clear Mathematical Properties

 Represents numbers of form M X 2

E

 Can reason about operations independent of implementation

 As if computed with perfect precision and then rounded

 Not the same as real arithmetic

 Violates associativity/distributivity

 Makes life difficult for compilers & serious numerical

applications programmers