Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Lecture Slides on Floating Point - Computer Systems and Program | CS 367, Study notes of Computer Science

George Mason University (GMU)Computer Science

Prof. Richard Carver

Material Type: Notes; Professor: Carver; Class: Computer Systems and Programm; Subject: Computer Science; University: George Mason University; Term: Unknown 1989;

Typology: Study notes

Pre 2010

Uploaded on 02/12/2009

koofers-user-2r9 🇺🇸

10 documents

1 / 16

This page cannot be seen from the preview

Don't miss anything!

1

Floating Point

Topics

IEEE Floating Point Standard

Rounding

Floating Point Operations

Mathematical properties

CS 367

– 2 – CS 367

Floating Point Puzzles

For each of the following C expressions, either:

Argue that it is true for all argument values

Explain why not true

•x == (int)(float) x

•x == (int)(double) x

•f == (float)(double) f

•d == (float) d

•f == -(-f);

•2/3 == 2/3.0

•d < 0.0 ⇒ ((d*2) < 0.0)

•d > f ⇒ -f > -d

•d * d >= 0.0

•(d+f)-d == f

int x = …;

float f = …;

double d = …;

Assume neither

d nor f is NaN

Discover Study notes of Computer Science George Mason University (GMU)

Partial preview of the text

Download Lecture Slides on Floating Point - Computer Systems and Program | CS 367 and more Study notes Computer Science in PDF only on Docsity!

Floating Point

Topics

 IEEE Floating Point Standard

 Rounding

 Floating Point Operations

 Mathematical properties

CS 367

2 – CS 367

Floating Point Puzzles

 For each of the following C expressions, either:

 Argue that it is true for all argument values

 Explain why not true

x == (int)(float) x
x == (int)(double) x
f == (float)(double) f
d == (float) d
f == -(-f);

d < 0.0 ⇒ ((d2) < 0.0)*
d > f ⇒ -f > -d
d * d >= 0.
(d+f)-d == f

int x = …;

float f = …;

double d = …;

Assume neither

d nor f is NaN

3 –

CS 367

IEEE Floating Point

IEEE Standard 754

 Established in 1985 as uniform standard for floating point

arithmetic

 Before that, many idiosyncratic formats

 Supported by all major CPUs

Driven by Numerical ConcernsDriven by Numerical Concerns

 Nice standards for rounding, overflow, underflow

 Hard to make go fast

 Numerical analysts predominated over hardware types in

defining standard

4 – CS 367

Fractional Binary Numbers

Representation

 Bits to right of “binary point” represent fractional powers of 2

 Represents rational number:

b

i

b

i – 1

b

2

b

1

b

0

b

1

b

2

b

3

b

j

i – 1

i

j

b

k

k =" j

i

7 –

CS 367

Numerical Form

s

M 2

E

Sign bit s determines whether number is negative or positive

Significand M normally a fractional value in range [1.0,2.0).

Exponent E weights value by power of two

Encoding

 MSB is sign bit

 exp field encodes E

 frac field encodes M

Floating Point Representation

s exp frac

8 – CS 367

Encoding

 MSB is sign bit

 exp field encodes E

 frac field encodes M

Sizes

 Single precision: 8 exp bits, 23 frac bits

32 bits total

 Double precision: 11 exp bits, 52 frac bits

64 bits total

 Extended precision: 15 exp bits, 63 frac bits

Only found in Intel-compatible machines

Stored in 80 bits

» 1 bit wasted

Floating Point Precisions

s exp frac

9 –

CS 367

“Normalized” Numeric Values

Condition

 exp ≠ 000 … 0 and exp ≠ 111 … 1

Exponent coded as

biased

value

E = Exp – Bias

 Exp : unsigned value denoted by exp

 Bias : Bias value

» Single precision: 127 ( Exp : 1…254, E : -126…127)

» Double precision: 1023 ( Exp : 1…2046, E : -1022…1023)

» in general: Bias = 2

e-

1, where e is number of exponent bits

SignificandSignificand coded with implied leading 1coded with implied leading 1

M = 1.xxx … x

2

 xxx…x: bits of frac

Minimum when 000 … 0 ( M = 1.0)

Maximum when 111 … 1 ( M = 2.0 – ε)

Get extra leading bit for “free”

10 – CS 367

Normalized Encoding Example

Value

Float F = 15213.0;

10

2

X 2

13

Significand

M = 1. 1101101101101

2

frac = 11011011011010000000000

2

Exponent

E = 13

Bias = 127

Exp = 140 = 10001100

2

Floating Point Representation (Class 02):

Hex: 4 6 6 D B 4 0 0

Binary: 0100 0110 0110 1101 1011 0100 0000 0000

13 –

CS 367

Summary of Floating Point

Real Number Encodings

NaN

-Normalized -Denorm +Denorm +Normalized

14 – CS 367

Tiny Floating Point Example

8-bit Floating Point Representation

 the sign bit is in the most significant bit.

 the next four bits are the exponent, with a bias of 7.

 the last three bits are the frac

Same General Form as IEEE Format

 normalized, denormalized

 representation of 0, NaN, infinity

s

exp frac

76 32 0

15 –

CS 367

Values Related to the Exponent

Exp exp E 2

E

0 0000 -6 1/64 (denorms)

15 1111 n/a (inf, NaN)

16 – CS 367

Dynamic Range

s exp frac E Value

0 1111 000 n/a inf

closest to zero

largest denorm

smallest norm

closest to 1 below

closest to 1 above

largest norm

Denormalized

numbers

Normalized

numbers

19 –

CS 367

Interesting Numbers

DescriptionDescription expexp fracfrac Numeric ValueNumeric Value

Zero

Zero 00

Smallest Pos.Smallest Pos. DenormDenorm.. 0000 …… 0000 0000 …… 0101 22

– {23,52}{23,52}

X 2X 2

– {126,1022}{126,1022}

 Single ≈ 1.4 X 10

45

 Double ≈ 4.9 X 10

324

LargestLargest DenormalizedDenormalized 0000 …… 0000 1111 …… 1111 (1.0(1.0 – – εε) X 2) X 2

– {126,1022}{126,1022}

 Single ≈ 1.18 X 10

38

 Double ≈ 2.2 X 10

308

Smallest Pos. NormalizedSmallest Pos. Normalized 0000 …… 0101 0000 …… 0000 1.0 X 21.0 X 2

{126,1022}

 Just larger than largest denormalized

One

One 01

Largest NormalizedLargest Normalized 1111 …… 1010 1111 …… 1111 (2.0(2.0 – – εε) X 2) X 2

{127,1023}

 Single ≈ 3.4 X 10

38

 Double ≈ 1.8 X 10

308

20 – CS 367

Special Properties of Encoding

FP Zero Same as Integer Zero

 All bits = 0

Can (Almost) Use Unsigned Integer Comparison

 Must first compare sign bits

 Must consider -0 = 0

 NaNs problematic

 Will be greater than any other values

 What should comparison yield?

 Otherwise OK

 Denorm vs. normalized

 Normalized vs. infinity

21 –

CS 367

Floating Point Operations

Conceptual View

 First compute exact result

 Make it fit into desired precision

Possibly overflow if exponent too large

Possibly round to fit into frac

Rounding Modes (illustrate with $ rounding)Rounding Modes (illustrate with $ rounding)

 Zero $1 $1 $1 $2 – $

 Round down (-∞) $1 $1 $1 $2 – $

 Round up (+∞) $2 $2 $2 $3 – $

 Nearest Even (default) $1 $2 $2 $2 – $

Note:

Round down: rounded result is close to but no greater than true result.
Round up: rounded result is close to but no less than true result.

22 – CS 367

Closer Look at Round-To-Even

Default Rounding Mode

 Hard to get any other kind without dropping into assembly

 All others are statistically biased

Sum of set of positive numbers will consistently be over- or under-

estimated

Applying to Other Decimal Places / Bit Positions

 When exactly halfway between two possible values

Round so that least significant digit is even

 E.g., round to nearest hundredth

1.2349999 1.23 (Less than half way)

1.2350001 1.24 (Greater than half way)

1.2350000 1.24 (Half way—round up)

1.2450000 1.24 (Half way—round down)

25 –

CS 367

FP Addition

Operands

s

M1 2

E

s

M2 2

E

 Assume E1 > E

Exact Result

s

M 2

E

 Sign s, significand M:

 Result of signed align & add

 Exponent E: E

FixingFixing

 If M ≥ 2, shift M right, increment E

 if M < 1, shift M left k positions, decrement E by k

 Overflow if E out of range

 Round M to fit frac precision

s

M

s

M

E1– E

s

M

26 – CS 367

Mathematical Properties of FP Add

Compare to those of

Abelian

Group

 Closed under addition? YES

But may generate infinity or NaN

 Commutative? YES

 Associative? NO

Overflow and inexactness of rounding

 0 is additive identity? YES

 Every element has additive inverse ALMOST

Except for infinities & NaNs

MonotonicityMonotonicity

 a ≥ b ⇒ a+ c ≥ b+ c? ALMOST

Except for infinities & NaNs

27 –

CS 367

Math. Properties of FP Mult

Compare to Commutative Ring

 Closed under multiplication? YES

But may generate infinity or NaN

 Multiplication Commutative? YES

 Multiplication is Associative? NO

Possibility of overflow, inexactness of rounding

 1 is multiplicative identity? YES

 Multiplication distributes over addition? NO

Possibility of overflow, inexactness of rounding

Monotonicity

 a ≥ b & c ≥ 0 ⇒ a * c ≥ b * c? ALMOST

Except for infinities & NaNs

28 – CS 367

Floating Point in C

C Guarantees Two Levels

float single precision

double double precision

Conversions

 Casting between int , float , and double changes numeric

values

 Double or float to int

 Truncates fractional part

 Like rounding toward zero

 Not defined when out of range

» Generally saturates to TMin or TMax

 int to double

 Exact conversion, as long as int has ≤ 53 bit word size

 int to float

 Will round according to rounding mode

31 –

CS 367

Summary

IEEE Floating Point Has Clear Mathematical PropertiesIEEE Floating Point Has Clear Mathematical Properties

 Represents numbers of form M X 2

E

 Can reason about operations independent of implementation

 As if computed with perfect precision and then rounded

 Not the same as real arithmetic

 Violates associativity/distributivity

 Makes life difficult for compilers & serious numerical

applications programmers

Lecture Slides on Floating Point - Computer Systems and Program | CS 367, Study notes of Computer Science

Related documents

Partial preview of the text

Download Lecture Slides on Floating Point - Computer Systems and Program | CS 367 and more Study notes Computer Science in PDF only on Docsity!

Floating Point

Topics

Topics

 IEEE Floating Point Standard

 Rounding

 Floating Point Operations

 Mathematical properties

CS 367

Floating Point Puzzles

IEEE Floating Point

IEEE Standard 754

IEEE Standard 754

 Established in 1985 as uniform standard for floating point

arithmetic

 Supported by all major CPUs

Driven by Numerical ConcernsDriven by Numerical Concerns

 Nice standards for rounding, overflow, underflow

 Hard to make go fast

Fractional Binary Numbers

Representation

Representation

 Bits to right of “binary point” represent fractional powers of 2

 Represents rational number:

Numerical Form

Numerical Form

M 2

Encoding

Encoding

 MSB is sign bit

 exp field encodes E

 frac field encodes M

Floating Point Representation

Encoding

Encoding

 MSB is sign bit

 exp field encodes E

 frac field encodes M

Sizes

Sizes

 Single precision: 8 exp bits, 23 frac bits

 Double precision: 11 exp bits, 52 frac bits

 Extended precision: 15 exp bits, 63 frac bits

Floating Point Precisions

“Normalized” Numeric Values

Condition

Condition

 exp ≠ 000 … 0 and exp ≠ 111 … 1

Exponent coded as

Exponent coded as

biased

biased

value

value

E = Exp – Bias

SignificandSignificand coded with implied leading 1coded with implied leading 1

M = 1.xxx … x

Normalized Encoding Example

Value

Value

X 2

Significand

Significand

M = 1. 1101101101101

Exponent

Exponent

E = 13

Summary of Floating Point

Real Number Encodings

Tiny Floating Point Example

8-bit Floating Point Representation

8-bit Floating Point Representation

 the sign bit is in the most significant bit.

 the next four bits are the exponent, with a bias of 7.

 the last three bits are the frac

Same General Form as IEEE Format

Same General Form as IEEE Format