Docsity
Docsity

Prepara i tuoi esami
Prepara i tuoi esami

Studia grazie alle numerose risorse presenti su Docsity


Ottieni i punti per scaricare
Ottieni i punti per scaricare

Guadagna punti aiutando altri studenti oppure acquistali con un piano Premium


Guide e consigli
Guide e consigli


A tutorial on data representation integers, floating point numbers, and characters, Esercizi di Elementi di Informatica

Fondamenti di informatica

Tipologia: Esercizi

2014/2015

Caricato il 04/04/2015

il4r1ett41
il4r1ett41 🇮🇹

4.5

(4)

25 documenti

1 / 24

Toggle sidebar

Questa pagina non è visibile nell’anteprima

Non perderti parti importanti!

bg1
10/10/13 A Tutorial on Da ta Repre sentation - Integer s, Floating-point numbers, and cha racters
www.ntu.edu.sg/home/e hchua/pr ogramming/java/Da taRepre sentation.html 1/24
TA BL E O F C O N TE N TS (H I DE)
yet another insignifi cant programming notes... | HOME
A Tutorial on Data
Representation
Integers, Floatingpoint
Numbers, and Characters
1. Numb er Sys tems
Humanbeings usedecimal (base10) and duodecimal(base 12) numbersystemsfor
countingandmeasurements(probablybecausewehave10fingersandtwobigtoes).
Computersusebinary(base 2)numbersystem,as theyare made frombinarydigita l
components(knownastransistors)operatingintwostatesonandoff.Incomputing,
wealsousehe xadecimal(base16) or octal (base8) numbersystems,as a compact
formforrepresentbinarynumbers.
1. 1 Deci mal( B ase10) Nu mberS y s tem
Decimalnumbersystemhastensymbols:0,1,2,3,4,5,6,7,8,and9,calleddigits.It
usespositionalnotation. That is,the leastsignificantdigit (rightmostdigit) isofthe
orderof 10^0 (units orones),t hesecond rightmostdigit is of the order of10^1
(tens),the thirdr ightmostdigit is ofthe order of 10^2 (hundreds),and soon. For
example,
735=7×10^2+3×10^1+5×10^0
WeshalldenoteadecimalnumberwithanoptionalsuffixDifambiguityarises.
1. 2 B ina ry(Bas e2 ) Numbe rS y ste m
Binary numbersystem hast wosymbols:0 and 1, called bits. It is alsoa positional
notation,forexample,
10110B=1×2^4+0×2^3+1×2^2+1×2^1+0×2^0
We shall denote a binary number with a suffix B. Some programming languages
denotebinarynumberswith prefix0b(e. g., 0b1001000),or prefix b with the bits
quoted(e.g.,b'10001111').
A binary digit is called a bit. Eight bits is called a byte (why 8bit unit? Probably
because8=23).
1. 3 Hex adec ima l( Bas e 16 ) Numb e rS yste m
1.NumberSystems
1.1Decimal(Base10)NumberSystem
1.2Binary(Base2)NumberSystem
1.3Hexadecimal(Base16)NumberSystem
1.4ConversionfromHexadecimaltoBinary
1.5ConversionfromBinarytoHexadecimal
1.6ConversionfromBasertoDecimal(Base10)
1.7ConversionfromDecimal(Base10)toBase
1.8Exercises(NumberSystemsConversion)
2.ComputerMemory&DataRepresentation
3.IntegerRepresentation
3.1nbitUnsignedIntegers
3.2SignedIntegers
3.3nbitSignIntegersinSignMagnitudeRepresentation
3.4nbitSignIntegersin1'sComplementRepresentation
3.5nbitSignIntegersin2'sComplementRepresentation
3.6Computersuse2'sComplementRepresentationforSignedIntegers
3.7Rangeofnbit2'sComplementSignedIntegers
3.8Decoding2'sComplementNumbers
3.9BigEndianvs.LittleEndian
3.10Exercise(IntegerRepresentation)
4.FloatingPointNumberRepresentation
4.1IEEE75432bitSinglePrecisionFloatingPointNumbers
4.2Exercises(FloatingpointNumbers)
4.3IEEE75464bitDoublePrecisionFloatingPointNumbers
4.4MoreonFloatingPointRepresentation
5.CharacterEncoding
5.17bitASCIICode(akaUSASCII,ISO/IEC646,ITUTT.50)
5.28bitLatin1(akaISO/IEC88591)
5.3Other8bitExtensionofUSASCII(ASCIIExtensions)
5.4Unicode(akaISO/IEC10646UniversalCharacterSet)
5.5UTF8(UnicodeTransformationFormat8bit)
5.6UTF16(UnicodeTransformationFormat16bit)
5.7UTF32(UnicodeTransformationFormat32bit)
5.8FormatsofMultiByte(e.g.,Unicode)TextFiles
5.9FormatsofTextFiles
5.10Windows'CMDCodepage
5.11ChineseCharacterSets
5.12CollatingSequences(forRankingCharacters)
5.13ForJavaProgrammersjava.nio.Charset
5.14ForJavaProgrammerscharand
5.15DisplayingHexValues&HexEditors
6.SummaryWhyBotheraboutDataRepresentation?
6.1Exercises(DataRepresentation)
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18

Anteprima parziale del testo

Scarica A tutorial on data representation integers, floating point numbers, and characters e più Esercizi in PDF di Elementi di Informatica solo su Docsity!

TABLE OF CONTENTS (HIDE)

yet another insignificant programming notes... | HOME

A Tutorial on Data

Representation

Integers, Floating point

Numbers, and Characters

1. Number Systems

Human beings use decimal (base 10 ) and duodecimal (base 12 ) number systems for counting and measurements (probably because we have 10 fingers and two big toes). Computers use binary (base 2 ) number system, as they are made from binary digital components (known as transistors) operating in two states -­‐ on and off. In computing, we also use hexadecimal (base 16 ) or octal (base 8 ) number systems, as a compact form for represent binary numbers. 1. 1 Decimal (Base 10 ) Number System Decimal number system has ten symbols: 0 , 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , and 9 , called digit s. It uses positional notation. That is, the least-­‐significant digit (right-­‐most digit) is of the order of 10 ^ 0 (units or ones), the second right-­‐most digit is of the order of 10 ^ 1 (tens), the third right-­‐most digit is of the order of 10 ^ 2 (hundreds), and so on. For example, 735 = 7 × 10 ^ 2 + 3 × 10 ^ 1 + 5 × 10 ^ 0 We shall denote a decimal number with an optional suffix D if ambiguity arises. 1. 2 Binary (Base 2 ) Number System Binary number system has two symbols: 0 and 1 , called bits. It is also a positional notation , for example, 10110 B = 1 × 2 ^ 4 + 0 × 2 ^ 3 + 1 × 2 ^ 2 + 1 × 2 ^ 1 + 0 × 2 ^ 0 We shall denote a binary number with a suffix B. Some programming languages denote binary numbers with prefix 0 b (e.g., 0 b 1001000 ), or prefix b with the bits quoted (e.g., b' 10001111 '). A binary digit is called a bit. Eight bits is called a byte (why 8 -­‐bit unit? Probably because 8 = 23 ). 1. 3 Hexadecimal (Base 16 ) Number System

1. Number Systems

  1. 1 Decimal (Base 10 ) Number System
  2. 2 Binary (Base 2 ) Number System
  3. 3 Hexadecimal (Base 16 ) Number System
  4. 4 Conversion from Hexadecimal to Binary
  5. 5 Conversion from Binary to Hexadecimal
  6. 6 Conversion from Base r to Decimal (Base 10 )
  7. 7 Conversion from Decimal (Base 10 ) to Base
  8. 8 Exercises (Number Systems Conversion)

2. Computer Memory & Data Representation

3. Integer Representation

  1. 1 n -­‐bit Unsigned Integers
  2. 2 Signed Integers
  3. 3 n -­‐bit Sign Integers in Sign-­‐Magnitude Representa
  4. 4 n -­‐bit Sign Integers in 1 's Complement Representa
  5. 5 n -­‐bit Sign Integers in 2 's Complement Representa
  6. 6 Computers use 2 's Complement Representation f
  7. 7 Range of n -­‐bit 2 's Complement Signed Integers
  8. 8 Decoding 2 's Complement Numbers
  9. 9 Big Endian vs. Little Endian
  10. 10 Exercise (Integer Representation)

4. Floating-­‐Point Number Representation

  1. 1 IEEE-­‐ 754 32 -­‐bit Single-­‐Precision Floating-­‐Point Nu
  2. 2 Exercises (Floating-­‐point Numbers)
  3. 3 IEEE-­‐ 754 64 -­‐bit Double-­‐Precision Floating-­‐Point N
  4. 4 More on Floating-­‐Point Representation

5. Character Encoding

  1. 1 7 -­‐bit ASCII Code (aka US-­‐ASCII, ISO/IEC 646 , ITU-­‐T
  2. 2 8 -­‐bit Latin-­‐ 1 (aka ISO/IEC 8859 -­‐ 1 )
  3. 3 Other 8 -­‐bit Extension of US-­‐ASCII (ASCII Extension
  4. 4 Unicode (aka ISO/IEC 10646 Universal Character
  5. 5 UTF-­‐ 8 (Unicode Transformation Format -­‐ 8 -­‐bit)
  6. 6 UTF-­‐ 16 (Unicode Transformation Format -­‐ 16 -­‐bit)
  7. 7 UTF-­‐ 32 (Unicode Transformation Format -­‐ 32 -­‐bit)
  8. 8 Formats of Multi-­‐Byte (e.g., Unicode) Text Files
  9. 9 Formats of Text Files
  10. 10 Windows' CMD Codepage
  11. 11 Chinese Character Sets
  12. 12 Collating Sequences (for Ranking Characters)
  13. 13 For Java Programmers -­‐ java.nio.Charset
  14. 14 For Java Programmers -­‐ char and
  15. 15 Displaying Hex Values & Hex Editors

6. Summary -­‐ Why Bother about Data Representation

  1. 1 Exercises (Data Representation)

Hexadecimal number system uses 16 symbols: 0 , 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , A, B, C, D, E, and F, called hex digits. It is a positional notation , for example, A 3 EH = 10 × 16 ^ 2 + 3 × 16 ^ 1 + 14 × 16 ^ 0 We shall denote a hexadecimal number (in short, hex) with a suffix H. Some programming languages denote hex numbers with prefix 0 x (e.g., 0 x 1 A 3 C 5 F), or prefix x with hex digit quoted (e.g., x'C 3 A 4 D 98 B'). Each hexadecimal digit is also called a hex digit. Most programming languages accept lowercase 'a' to 'f' as well as uppercase 'A' to 'F'. Computers uses binary system in their internal operations, as they are built from binary digital electronic components. However, writing or reading a long sequence of binary bits is cumbersome and error-­‐prone. Hexadecimal system is used as a compact form or shorthand for binary bits. Each hex digit is equivalent to 4 binary bits, i.e., shorthand for 4 bits, as follows: 0 H ( 0000 B) ( 0 D)

1 H ( 0001 B)

( 1 D)

2 H ( 0010 B)

( 2 D)

3 H ( 0011 B)

( 3 D)

4 H ( 0100 B)

( 4 D)

5 H ( 0101 B)

( 5 D)

6 H ( 0110 B)

( 6 D)

7 H ( 0111 B)

( 7 D)

8 H ( 1000 B)

( 8 D)

9 H ( 1001 B)

( 9 D)

AH ( 1010 B)

( 10 D)

BH ( 1011 B)

( 11 D)

CH ( 1100 B)

( 12 D)

DH ( 1101 B)

( 13 D)

EH ( 1110 B)

( 14 D)

FH ( 1111 B)

( 15 D)

1. 4 Conversion from Hexadecimal to Binary Replace each hex digit by the 4 equivalent bits, for examples, A 3 C 5 H = 1010 0011 1100 0101 B 102 AH = 0001 0000 0010 1010 B 1. 5 Conversion from Binary to Hexadecimal Starting from the right-­‐most bit (least-­‐significant bit), replace each group of 4 bits by the equivalent hex digit (pad the left-­‐most bits with zero if necessary), for examples, 1001001010 B = 0010 0100 1010 B = 24 AH 10001011001011 B = 0010 0010 1100 1011 B = 22 CBH It is important to note that hexadecimal number provides a compact form or shorthand for representing binary bits. 1. 6 Conversion from Base r to Decimal (Base 10 ) Given a n -­‐digit base r number: dn-­‐ 1 dn-­‐ 2 dn-­‐ 3 ... d 3 d 2 d 1 d 0 (base r), the decimal equivalent is given by: dn-­‐ 1 × r^(n-­‐ 1 ) + dn-­‐ 2 × r^(n-­‐ 2 ) + ... + d 1 × r^ 1 + d 0 × r^ 0 1. 7 Conversion from Decimal (Base 10 ) to Base r Use repeated division/remainder. For example, To convert 261 D to hexadecimal: 261 / 16 quotient= 16 remainder= 5 16 / 16 quotient= 1 remainder= 0 1 / 16 quotient= 0 remainder= 1 (quotient= 0 stop) Hence, 261 D = 105 H 1. 8 Exercises (Number Systems Conversion)

  1. Convert the following decimal numbers into binary and hexadecimal numbers:

Reference and images: Wikipedia. 3. Integer Representation Integers are whole numbers or fixed-­‐point numbers with the radix point fixed after the least-­‐significant bit. They are contrast to real numbers or floating-­‐point numbers , where the position of the radix point varies. It is important to take note that integers and floating-­‐point numbers are treated differently in computers. They have different representation and are processed differently (e.g., floating-­‐point numbers are processed in a so-­‐called floating-­‐point processor). Floating-­‐point numbers will be discussed later. Computers use a fixed number of bits to represent an integer. The commonly-­‐used bit-­‐lengths for integers are 8 -­‐bit, 16 -­‐bit, 32 -­‐bit or 64 -­‐bit. Besides bit-­‐lengths, there are two representation schemes for integers:

  1. Unsigned Integers : can represent zero and positive integers.
  2. Signed Integers : can represent zero, positive and negative integers. Three representation schemes had been proposed for signed integers: a. Sign-­‐Magnitude representation b. 1 's Complement representation c. 2 's Complement representation You, as the programmer, need to decide on the bit-­‐length and representation scheme for your integers, depending on your application's requirements. Suppose that you need a counter for counting a small quantity from 0 up to 200 , you might choose the 8 -­‐ bit unsigned integer scheme as there is no negative numbers involved. 3. 1 n bit Unsigned Integers Unsigned integers can represent zero and positive integers, but not negative integers. The value of an unsigned integer is interpreted as " the magnitude of its underlying binary pattern ".

Example 1 : Suppose that n = 8 and the binary pattern is 0100 0001 B, the value of this unsigned integer is 1 × 2 ^ 0 + 1 × 2 ^ 6 =

65 D.

Example 2 : Suppose that n = 16 and the binary pattern is 0001 0000 0000 1000 B, the value of this unsigned integer is

1 × 2 ^ 3 + 1 × 2 ^ 12 = 4104 D.

Example 3 : Suppose that n = 16 and the binary pattern is 0000 0000 0000 0000 B, the value of this unsigned integer is 0.

An n -­‐bit pattern can represent 2 ^ n distinct integers. An n -­‐bit unsigned integer can represent integers from 0 to ( 2 ^ n )-­‐ 1 , as tabulated below:

n Minimum Maximum

8 0 ( 2 ^ 8 )-­‐ 1 (= 255 )

16 0 ( 2 ^ 16 )-­‐ 1 (= 65 , 535 )

32 0 ( 2 ^ 32 )-­‐ 1 (= 4 , 294 , 967 , 295 ) ( 9 + digits) 64 0 ( 2 ^ 64 )-­‐ 1 (= 18 , 446 , 744 , 073 , 709 , 551 , 615 ) ( 19 + digits)

3. 2 Signed Integers Signed integers can represent zero, positive integers, as well as negative integers. Three representation schemes are available for signed integers:

  1. Sign-­‐Magnitude representation
  2. 1 's Complement representation
  3. 2 's Complement representation In all the above three schemes, the most-­‐significant bit (msb) is called the sign bit. The sign bit is used to represent the sign of the integer -­‐ with 0 for positive integers and 1 for negative integers. The magnitude of the integer, however, is interpreted differently in different schemes. 3. 3 n bit Sign Integers in Sign Magnitude Representation In sign-­‐magnitude representation: The most-­‐significant bit (msb) is the sign bit , with value of 0 representing positive integer and 1 representing negative integer. The remaining n -­‐ 1 bits represents the magnitude (absolute value) of the integer. The absolute value of the integer is interpreted as "the magnitude of the ( n -­‐ 1 )-­‐bit binary pattern".

Example 1 : Suppose that n = 8 and the binary representation is 0 100 0001 B.

Sign bit is 0 positive Absolute value is 100 0001 B = 65 D Hence, the integer is + 65 D

Example 2 : Suppose that n = 8 and the binary representation is 1 000 0001 B.

Sign bit is 1 negative Absolute value is 000 0001 B = 1 D Hence, the integer is -­‐ 1 D

Example 3 : Suppose that n = 8 and the binary representation is 0 000 0000 B.

Sign bit is 0 positive Absolute value is 000 0000 B = 0 D Hence, the integer is + 0 D

Example 4 : Suppose that n = 8 and the binary representation is 1 000 0000 B.

Sign bit is 1 negative Absolute value is 000 0000 B = 0 D Hence, the integer is -­‐ 0 D

3. 5 n bit Sign Integers in 2 's Complement Representation In 2 's complement representation: Again, the most significant bit (msb) is the sign bit , with value of 0 representing positive integers and 1 representing negative integers. The remaining n -­‐ 1 bits represents the magnitude of the integer, as follows: for positive integers, the absolute value of the integer is equal to "the magnitude of the ( n -­‐ 1 )-­‐bit binary pattern". for negative integers, the absolute value of the integer is equal to "the magnitude of the complement of the ( n -­‐ 1 )-­‐bit binary pattern plus one " (hence called 2 's complement).

Example 1 : Suppose that n = 8 and the binary representation 0 100 0001 B.

Sign bit is 0 positive Absolute value is 100 0001 B = 65 D Hence, the integer is + 65 D

Example 2 : Suppose that n = 8 and the binary representation 1 000 0001 B.

Sign bit is 1 negative Absolute value is the complement of 000 0001 B plus 1 , i.e., 111 1110 B + 1 B = 127 D Hence, the integer is -­‐ 127 D

Example 3 : Suppose that n = 8 and the binary representation 0 000 0000 B.

Sign bit is 0 positive Absolute value is 000 0000 B = 0 D Hence, the integer is + 0 D

Example 4 : Suppose that n = 8 and the binary representation 1 111 1111 B.

Sign bit is 1 negative Absolute value is the complement of 111 1111 B plus 1 , i.e., 000 0000 B + 1 B = 1 D Hence, the integer is -­‐ 1 D

3. 6 Computers use 2 's Complement Representation for Signed Integers We have discussed three representations for signed integers: signed-­‐magnitude, 1 's complement and 2 's complement. Computers use 2 's complement in representing signed integers. This is because:

  1. There is only one representation for the number zero in 2 's complement, instead of two representations in sign-­‐magnitude and 1 's complement.
  2. Positive and negative integers can be treated together in addition and subtraction. Subtraction can be carried out using the "addition logic".

Example 1 : Addition of Two Positive Integers: Suppose that n= 8 , 65 D + 5 D = 70 D

65 D → 0100 0001 B

5 D → 0000 0101 B(+

0100 0110 B → 70 D (OK)

Example 2 : Subtraction is treated as Addition of a Positive and a Negative Integers: Suppose that n= 8 , 5 D

-­‐ 5 D = 65 D + (-­‐ 5 D) = 60 D

65 D → 0100 0001 B

-­‐ 5 D → 1111 1011 B(+

0011 1100 B → 60 D (discard carry -­‐ OK)

Example 3 : Addition of Two Negative Integers: Suppose that n= 8 , -­‐ 65 D -­‐ 5 D = (-­‐ 65 D) + (-­‐ 5 D) = -­‐ 70 D

-­‐ 65 D → 1011 1111 B

-­‐ 5 D → 1111 1011 B(+

  1. If S= 0 , the number is positive and its absolute value is the binary value of the remaining n -­‐ 1 bits.
  2. If S= 1 , the number is negative. you could "invert the n -­‐ 1 bits and plus 1 " to get the absolute value of negative number. Alternatively, you could scan the remaining n -­‐ 1 bits from the right (least-­‐significant bit). Look for the first occurrence of 1. Flip all the bits to the left of that first occurrence of 1. The flipped pattern gives the absolute value. For example, n = 8 , bit pattern = 1 100 0100 B S = 1 → negative Scanning from the right and flip all the bits to the left of the first occurrence of 1 011 1100 B = 60 D Hence, the value is -­‐ 60 D 3. 9 Big Endian vs. Little Endian Modern computers store one byte of data in each memory address or location, i.e., byte addressable memory. An 32 -­‐bit integer is, therefore, stored in 4 memory addresses. The term"Endian" refers to the order of storing bytes in computer memory. In "Big Endian" scheme, the most significant byte is stored first in the lowest memory address (or big in first), while "Little Endian" stores the least significant bytes in the lowest memory address. For example, the 32 -­‐bit integer 12345678 H ( 221505317010 ) is stored as 12 H 34 H 56 H 78 H in big endian; and 78 H 56 H 34 H 12 H in little endian. An 16 -­‐bit integer 00 H 01 H is interpreted as 0001 H in big endian, and 0100 H as little endian. 3. 10 Exercise (Integer Representation)
  3. What are the ranges of 8 -­‐bit, 16 -­‐bit, 32 -­‐bit and 64 -­‐bit integer, in "unsigned" and "signed" representation?
  4. Give the value of 88 , 0 , 1 , 127 , and 255 in 8 -­‐bit unsigned representation.
  5. Give the value of + 88 , -­‐ 88 , -­‐ 1 , 0 , + 1 , -­‐ 128 , and + 127 in 8 -­‐bit 2 's complement signed representation.
  6. Give the value of + 88 , -­‐ 88 , -­‐ 1 , 0 , + 1 , -­‐ 127 , and + 127 in 8 -­‐bit sign-­‐magnitude representation.
  7. Give the value of + 88 , -­‐ 88 , -­‐ 1 , 0 , + 1 , -­‐ 127 and + 127 in 8 -­‐bit 1 's complement representation.
  8. [TODO] more.

Answers

  1. The range of unsigned n -­‐bit integers is [ 0 , 2 ^n -­‐ 1 ]. The range of n -­‐bit 2 's complement signed integer is [-­‐ 2 ^(n-­‐ 1 ),
  • 2 ^(n-­‐ 1 )-­‐ 1 ];
  1. 88 ( 0101 1000 ), 0 ( 0000 0000 ), 1 ( 0000 0001 ), 127 ( 0111 1111 ), 255 ( 1111 1111 ).
    • 88 ( 0101 1000 ), -­‐ 88 ( 1010 1000 ), -­‐ 1 ( 1111 1111 ), 0 ( 0000 0000 ), + 1 ( 0000 0001 ), -­‐ 128 ( 1000 0000 ), + 127 ( 0111 1111 ).
    • 88 ( 0101 1000 ), -­‐ 88 ( 1101 1000 ), -­‐ 1 ( 1000 0001 ), 0 ( 0000 0000 or 1000 0000 ), + 1 ( 0000 0001 ), -­‐ 127 ( 1111 1111 ), + 127 ( 0111 1111 ).
    • 88 ( 0101 1000 ), -­‐ 88 ( 1010 0111 ), -­‐ 1 ( 1111 1110 ), 0 ( 0000 0000 or 1111 1111 ), + 1 ( 0000 0001 ), -­‐ 127 ( 1000 0000 ), + 127 ( 0111 1111 ). 4. Floating Point Number Representation A floating-­‐point number (or real number) can represent a very large ( 1. 23 × 10 ^ 88 ) or a very small ( 1. 23 × 10 ^-­‐ 88 ) value. It could also represent very large negative number (-­‐ 1. 23 × 10 ^ 88 ) and very small negative number (-­‐ 1. 23 × 10 ^ 88 ), as well as zero, as illustrated:

A floating-­‐point number is typically expressed in the scientific notation, with a fraction (F), and an exponent (E) of a certain radix (r), in the form of F×r^E. Decimal numbers use radix of 10 (F× 10 ^E); while binary numbers use radix of 2 (F× 2 ^E). Representation of floating point number is not unique. For example, the number 55. 66 can be represented as 5. 566 × 10 ^ 1 ,

  1. 5566 × 10 ^ 2 , 0. 05566 × 10 ^ 3 , and so on. The fractional part can be normalized. In the normalized form, there is only a single non-­‐zero digit before the radix point. For example, decimal number 123. 4567 can be normalized as 1. 234567 × 10 ^ 2 ; binary number 1010. 1011 B can be normalized as 1. 011011 B× 2 ^ 3. It is important to note that floating-­‐point numbers suffer from loss of precision when represented with a fixed number of bits (e.g., 32 -­‐bit or 64 -­‐bit). This is because there are infinite number of real numbers (even within a small range of says 0. 0 to 0. 1 ). On the other hand, a n -­‐bit binary pattern can represent a finite 2 ^ n distinct numbers. Hence, not all the real numbers can be represented. The nearest approximation will be used instead, resulted in loss of accuracy. It is also important to note that floating number arithmetic is very much less efficient than integer arithmetic. It could be speed up with a so-­‐called dedicated floating-­‐point co-­‐processor. Hence, use integers if your application does not require floating-­‐point numbers. In computers, floating-­‐point numbers are represented in scientific notation of fraction (F) and exponent (E) with a radix of 2 , in the form of F× 2 ^E. Both E and F can be positive as well as negative. Modern computers adopt IEEE 754 standard for representing floating-­‐point numbers. There are two representation schemes: 32 -­‐bit single-­‐precision and 64 -­‐bit double-­‐precision. 4. 1 IEEE 754 32 bit Single Precision Floating Point Numbers In 32 -­‐bit single-­‐precision floating-­‐point representation: The most significant bit is the sign bit (S), with 0 for negative numbers and 1 for positive numbers. The following 8 bits represent exponent (E). The remaining 23 bits represents fraction (F).

Normalized Form

Let's illustrate with an example, suppose that the 32 -­‐bit pattern is 1 1000 0001 011 0000 0000 0000 0000 0000 , with: S = 1 E = 1000 0001 F = 011 0000 0000 0000 0000 0000 In the normalized form , the actual fraction is normalized with an implicit leading 1 in the form of 1 .F. In this example, the actual fraction is 1. 011 0000 0000 0000 0000 0000 = 1 + 1 × 2 ^-­‐ 2 + 1 × 2 ^-­‐ 3 = 1. 375 D. The sign bit represents the sign of the number, with S= 0 for positive and S= 1 for negative number. In this example with S= 1 , this is a negative number, i.e., -­‐ 1. 375 D. In normalized form, the actual exponent is E-­‐ 127 (so-­‐called excess-­‐ 127 or bias-­‐ 127 ). This is because we need to represent both

4. 2 Exercises (Floating point Numbers)

  1. Compute the largest and smallest positive numbers that can be represented in the 32 -­‐bit normalized form.
  2. Compute the largest and smallest negative numbers can be represented in the 32 -­‐bit normalized form.
  3. Repeat ( 1 ) for the 32 -­‐bit denormalized form.
  4. Repeat ( 2 ) for the 32 -­‐bit denormalized form.

Hints:

  1. Largest positive number: S= 0 , E= 1111 1110 ( 254 ), F= 111 1111 1111 1111 1111 1111. Smallest positive number: S= 0 , E= 0000 00001 ( 1 ), F= 000 0000 0000 0000 0000 0000.
  2. Same as above, but S= 1.
  3. Largest positive number: S= 0 , E= 0 , F= 111 1111 1111 1111 1111 1111. Smallest positive number: S= 0 , E= 0 , F= 000 0000 0000 0000 0000 0001.
  4. Same as above, but S= 1.

Notes For Java Users

You can use JDK methods Float.intBitsToFloat(int bits) or Double.longBitsToDouble(long bits) to create a single-­‐precision 32 -­‐bit float or double-­‐precision 64 -­‐bit double with the specific bit patterns, and print their values. For examples, System.out.println(Float.intBitsToFloat( 0 x 7 fffff)); System.out.println(Double.longBitsToDouble( 0 x 1 fffffffffffffL)); 4. 3 IEEE 754 64 bit Double Precision Floating Point Numbers The representation scheme for 64 -­‐bit double-­‐precision is similar to the 32 -­‐bit single-­‐precision: The most significant bit is the sign bit (S), with 0 for negative numbers and 1 for positive numbers. The following 11 bits represent exponent (E). The remaining 52 bits represents fraction (F). The value (N) is calculated as follows: Normalized form: For 1 ≤ E ≤ 2046 , N = (-­‐ 1 )^S × 1 .F × 2 ^(E-­‐ 1023 ). Denormalized form: For E = 0 , N = (-­‐ 1 )^S × 0 .F × 2 ^(-­‐ 1022 ). These are in the denormalized form. For E = 2047 , N represents special values, such as ±INF (infinity), NaN (not a number). 4. 4 More on Floating Point Representation There are three parts in the floating-­‐point representation: The sign bit (S) is self-­‐explanatory ( 0 for positive numbers and 1 for negative numbers). For the exponent (E), a so-­‐called bias (or excess ) is applied so as to represent both positive and negative exponent. The bias is set at half of the range. For single precision with an 8 -­‐bit exponent, the bias is 127 (or excess-­‐ 127 ). For double precision with a 11 -­‐bit exponent, the bias is 1023 (or excess-­‐ 1023 ). The fraction (F) (also called the mantissa or significand ) is composed of an implicit leading bit (before the radix point) and the fractional bits (after the radix point). The leading bit for normalized numbers is 1 ; while the leading bit for denormalized

numbers is 0.

Normalized Floating Point Numbers

In normalized form, the radix point is placed after the first non-­‐zero digit, e,g., 9. 8765 D× 10 ^-­‐ 23 D, 1. 001011 B× 2 ^ 11 B. For binary number, the leading bit is always 1 , and need not be represented explicitly -­‐ this saves 1 bit of storage. In IEEE 754 's normalized form: For single-­‐precision, 1 ≤ E ≤ 254 with excess of 127. Hence, the actual exponent is from -­‐ 126 to + 127. Negative exponents are used to represent small numbers (< 1. 0 ); while positive exponents are used to represent large numbers (> 1. 0 ). N = (-­‐ 1 )^S × 1 .F × 2 ^(E-­‐ 127 ) For double-­‐precision, 1 ≤ E ≤ 2046 with excess of 1023. The actual exponent is from -­‐ 1022 to + 1023 , and N = (-­‐ 1 )^S × 1 .F × 2 ^(E-­‐ 1023 ) Take note that n-­‐bit pattern has a finite number of combinations (= 2 ^n), which could represent finite distinct numbers. It is not possible to represent the infinite numbers in the real axis (even a small range says 0. 0 to 1. 0 has infinite numbers). That is, not all floating-­‐point numbers can be accurately represented. Instead, the closest approximation is used, which leads to loss of accuracy. The minimum and maximum normalized floating-­‐point numbers are:

Precision Normalized N(min) Normalized N(max)

Single 0080 0000 H 0 00000001 00000000000000000000000 B E = 1 , F = 0 N(min) = 1. 0 B × 2 ^-­‐ 126 (≈ 1. 17549435 × 10 ^-­‐ 38 )

7 F 7 F FFFFH

00000000000000000000000 B

E = 254 , F = 0

N(max) = 1. 1 ... 1 B × 2 ^ 127 = ( 2 -­‐ 2 ^-­‐ 23 ) × 2 ^ 127 (≈ 3. 4028235 × 10 ^ 38 ) Double 0010 0000 0000 0000 H N(min) = 1. 0 B × 2 ^-­‐ 1022 (≈ 2. 2250738585072014 × 10 ^-­‐ 308 )

7 FEF FFFF FFFF FFFFH

N(max) = 1. 1 ... 1 B × 2 ^ 1023 = ( 2 -­‐ 2 ^-­‐ 52 ) × 2 ^ 1023 (≈ 1. 7976931348623157 × 10 ^ 308 )

Denormalized Floating Point Numbers

If E = 0 , but the fraction is non-­‐zero, then the value is in denormalized form, and a leading bit of 0 is assumed, as follows: For single-­‐precision, E = 0 , N = (-­‐ 1 )^S × 0 .F × 2 ^(-­‐ 126 ) For double-­‐precision, E = 0 , N = (-­‐ 1 )^S × 0 .F × 2 ^(-­‐ 1022 ) Denormalized form can represent very small numbers closed to zero, and zero, which cannot be represented in normalized form, as shown in the above figure. The minimum and maximum of denormalized floating-­‐point numbers are:

Precision Denormalized D(min) Denormalized D(max)

5 P^ Q^ R^ S^ T^ U^ V^ W^ X^ Y^ Z^ [^ ^ ]^ ^^ _

6 `^ a^ b^ c^ d^ e^ f^ g^ h^ i^ j^ k^ l^ m^ n^ o

7 p^ q^ r^ s^ t^ u^ v^ w^ x^ y^ z^ {^ |^ }^ ~

Code number 32 D ( 20 H) is the blank or space character. ' 0 ' to ' 9 ': 30 H-­‐ 39 H ( 0011 0001 B to 0011 1001 B) or ( 0011 xxxxB where xxxx is the equivalent integer value) 'A' to 'Z': 41 H-­‐ 5 AH ( 0101 0001 B to 0101 1010 B) or ( 010 x xxxxB). 'A' to 'Z' are continuous without gap. 'a' to 'z': 61 H-­‐ 7 AH ( 0110 0001 B to 0111 1010 B) or ( 011 x xxxxB). 'A' to 'Z' are also continuous without gap. However, there is a gap between uppercase and lowercase letters. To convert between upper and lowercase, flip the value of bit-­‐ 5. Code numbers 0 D ( 00 H) to 31 D ( 1 FH), and 127 D ( 7 FH) are special control characters, which are non-­‐printable (non-­‐ displayable), as tabulated below. Many of these characters were used in the early days for transmission control (e.g., STX, ETX) and printer control (e.g., Form-­‐Feed), which are now obsolete. The remaining meaningful codes today are: 09 H for Tab ('\t'). 0 AH for Line-­‐Feed or newline (LF, '\n') and 0 DH for Carriage-­‐Return (CR, 'r'), which are used as line delimiter (aka line separator , end-­‐of-­‐line ) for text files. There is unfortunately no standard for line delimiter: Unixes and Mac use 0 AH ("\n"), Windows use 0 D 0 AH ("\r\n"). Programming languages such as C/C++/Java (which was created on Unix) use 0 AH ("\n"). In programming languages such as C/C++/Java, line-­‐feed ( 0 AH) is denoted as '\n', carriage-­‐return ( 0 DH) as '\r', tab ( 09 H) as '\t'.

DEC HEX Meaning DEC HEX Meaning

0 00 NUL Null 17 11 DC 1 Device Control 1 1 01 SOH Start of Heading 18 12 DC 2 Device Control 2 2 02 STX Start of Text 19 13 DC 3 Device Control 3 3 03 ETX End of Text 20 14 DC 4 Device Control 4 4 04 EOT End of Transmission 21 15 NAK Negative Ack. 5 05 ENQ Enquiry 22 16 SYN Sync. Idle 6 06 ACK Acknowledgment 23 17 ETB End of Transmission 7 07 BEL Bell 24 18 CAN Cancel 8 08 BS Back Space '\b' 25 19 EM End of Medium 9 09 HT Horizontal Tab '\t' 26 1 A SUB Substitute 10 0 A LF Line Feed '\n' 27 1 B ESC Escape 11 0 B VT Vertical Feed 28 1 C IS 4 File Separator 12 0 C FF Form Feed 'f' 29 1 D IS 3 Group Separator 13 0 D CR Carriage Return '\r' 30 1 E IS 2 Record Separator

14 0 E SO Shift Out 31 1 F IS 1 Unit Separator 15 0 F SI Shift In 16 10 DLE Datalink Escape 127 7 F DEL Delete 5. 2 8 bit Latin 1 (aka ISO/IEC 8859 1 ) ISO/IEC-­‐ 8859 is a collection of 8 -­‐bit character encoding standards for the western languages. ISO/IEC 8859 -­‐ 1 , aka Latin alphabet No. 1 , or Latin-­‐ 1 in short, is the most commonly-­‐used encoding scheme for western european languages. It has 191 printable characters from the latin script, which covers languages like English, German, Italian, Portuguese and Spanish. Latin-­‐ 1 is backward compatible with the 7 -­‐bit US-­‐ASCII code. That is, the first 128 characters in Latin-­‐ 1 (code numbers 0 to 127 ( 7 FH)), is the same as US-­‐ASCII. Code numbers 128 ( 80 H) to 159 ( 9 FH) are not assigned. Code numbers 160 (A 0 H) to 255 (FFH) are given as follows:

Hex 0 1 2 3 4 5 6 7 8 9 A B C D E F

A NBSP^ ¡^ ¢^ £^ ¤^ ¥^ ¦^ §^ ¨^ ©^ ª^ «^ ¬^ SHY^ ®^ ¯ˉ

B °^ ±^ ²^ ³^ ´^ μ^ ¶^ ·∙^ ¸^ ¹^ º^ »^ ¼^ ½^ ¾^ ¿

C À^ Á^ Â^ Ã^ Ä^ Å^ Æ^ Ç^ È^ É^ Ê^ Ë^ Ì^ Í^ Î^ Ï

D Ð^ Ñ^ Ò^ Ó^ Ô^ Õ^ Ö^ ×^ Ø^ Ù^ Ú^ Û^ Ü^ Ý^ Þ^ ß

E à^ á^ â^ ã^ ä^ å^ æ^ ç^ è^ é^ ê^ ë^ ì^ í^ î^ ï

F ð^ ñ^ ò^ ó^ ô^ õ^ ö^ ÷^ ø^ ù^ ú^ û^ ü^ ý^ þ^ ÿ

ISO/IEC-­‐ 8859 has 16 parts. Besides the most commonly-­‐used Part 1 , Part 2 is meant for Central European (Polish, Czech, Hungarian, etc), Part 3 for South European (Turkish, etc), Part 4 for North European (Estonian, Latvian, etc), Part 5 for Cyrillic, Part 6 for Arabic, Part 7 for Greek, Part 8 for Hebrew, Part 9 for Turkish, Part 10 for Nordic, Part 11 for Thai, Part 12 was abandon, Part 13 for Baltic Rim, Part 14 for Celtic, Part 15 for French, Finnish, etc. Part 16 for South-­‐Eastern European. 5. 3 Other 8 bit Extension of US ASCII (ASCII Extensions) Beside the standardized ISO-­‐ 8859 -­‐x, there are many 8 -­‐bit ASCII extensions, which are not compatible with each others. ANSI (American National Standards Institute) (aka Windows-­‐ 1252 , or Windows Codepage 1252 ): for Latin alphabets used in the legacy DOS/Windows systems. It is a superset of ISO-­‐ 8859 -­‐ 1 with code numbers 128 ( 80 H) to 159 ( 9 FH) assigned to displayable characters, such as "smart" single-­‐quotes and double-­‐quotes. A common problem in web browsers is that all the quotes and apostrophes (produced by "smart quotes" in some Microsoft software) were replaced with question marks or some strange symbols. It it because the document is labeled as ISO-­‐ 8859 -­‐ 1 (instead of Windows-­‐ 1252 ), where these code numbers are undefined. Most modern browsers and e-­‐mail clients treat charset ISO-­‐ 8859 -­‐ 1 as Windows-­‐ 1252 in order to accommodate such mis-­‐labeling.

Hex 0 1 2 3 4 5 6 7 8 9 A B C D E F

8 €^ ‚^ ƒ^ „^ …^ †^ ‡^ ˆ^ ‰^ Š^ ‹^ Œ^ Ž

9 ‘^ ’^ “^ ”^ •^ –^ —^ ™^ š^ ›^ œ^ ž^ Ÿ

EBCDIC (Extended Binary Coded Decimal Interchange Code): Used in the early IBM computers. 5. 4 Unicode (aka ISO/IEC 10646 Universal Character Set) Before Unicode, no single character encoding scheme could represent characters in all languages. For example, western european uses several encoding schemes (in the ISO-­‐ 8859 -­‐x family). Even a single language like Chinese has a few encoding schemes (GB 2312 /GBK, BIG 5 ). Many encoding schemes are in conflict of each other, i.e., the same code number is assigned to different characters. Unicode aims to provide a standard character encoding scheme, which is universal, efficient, uniform and unambiguous. Unicode standard is maintained by a non-­‐profit organization called the Unicode Consortium (@ www.unicode.org). Unicode is an ISO/IEC

can be identified and decoded easily. Example : (Unicode: 60 A 8 H 597 DH) Unicode (UCS-­‐ 2 ) is 60 A 8 H = 0110 0000 10 101000 B UTF-­‐ 8 is 11100110 10000010 10101000 B = E 6 82 A 8 H Unicode (UCS-­‐ 2 ) is 597 DH = 0101 1001 01 111101 B UTF-­‐ 8 is 11100101 10100101 10111101 B = E 5 A 5 BDH 5. 6 UTF 16 (Unicode Transformation Format 16 bit) UTF-­‐ 16 is a variable-­‐length Unicode character encoding scheme, which uses 2 to 4 bytes. UTF-­‐ 16 is not commonly used. The transformation table is as follows:

Unicode UTF-­‐ 16 Code Bytes

xxxxxxxx xxxxxxxx Same as UCS-­‐ 2 -­‐ no encoding

000 uuuuu zzzzyyyy yyxxxxxx (uuuuu≠ 0 ) 110110 ww wwzzzzyy 110111 yy yyxxxxxx (wwww = uuuuu -­‐ 1 )

Take note that for the 65536 characters in BMP, the UTF-­‐ 16 is the same as UCS-­‐ 2 ( 2 bytes). However, 4 bytes are used for the supplementary characters outside the BMP. For BMP characters, UTF-­‐ 16 is the same as UCS-­‐ 2. For supplementary characters, each character requires a pair 16 -­‐bit values, the first from the high-­‐surrogates range, (\uD 800 -­‐\uDBFF), the second from the low-­‐surrogates range (\uDC 00 -­‐\uDFFF). 5. 7 UTF 32 (Unicode Transformation Format 32 bit) Same as UCS-­‐ 4 , which uses 4 bytes for each character -­‐ unencoded. 5. 8 Formats of Multi Byte (e.g., Unicode) Text Files

Endianess (or byte-­‐order) : For a multi-­‐byte character, you need to take care of the order of the bytes in storage. In big

endian , the most significant byte is stored at the memory location with the lowest address (big byte first). In little endian , the most significant byte is stored at the memory location with the highest address (little byte first). For example, (with Unicode number of 60 A 8 H) is stored as 60 A 8 in big endian; and stored as A 8 60 in little endian. Big endian, which produces a more readable hex dump, is more commonly-­‐used, and is often the default.

BOM (Byte Order Mark) : BOM is a special Unicode character having code number of FEFFH, which is used to differentiate big-­‐

endian and little-­‐endian. For big-­‐endian, BOM appears as FE FFH in the storage. For little-­‐endian, BOM appears as FF FEH. Unicode reserves these two code numbers to prevent it from crashing with another character. Unicode text files could take on these formats: Big Endian: UCS-­‐ 2 BE, UTF-­‐ 16 BE, UTF-­‐ 32 BE. Little Endian: UCS-­‐ 2 LE, UTF-­‐ 16 LE, UTF-­‐ 32 LE. UTF-­‐ 16 with BOM. The first character of the file is a BOM character, which specifies the endianess. For big-­‐endian, BOM appears as FE FFH in the storage. For little-­‐endian, BOM appears as FF FEH. UTF-­‐ 8 file is always stored as big endian. BOM plays no part. However, in some systems (in particular Windows), a BOM is added as the first character in the UTF-­‐ 8 file as the signature to identity the file as UTF-­‐ 8 encoded. The BOM character (FEFFH) is encoded in UTF-­‐ 8 as EF BB BF. Adding a BOM as the first character of the file is not recommended, as it may be incorrectly interpreted in other system. You can have a UTF-­‐ 8 file without BOM. 5. 9 Formats of Text Files

Line Delimiter or End-­‐Of-­‐Line (EOL) : Sometimes, when you use the Windows NotePad to open a text file (created in Unix or

Mac), all the lines are joined together. This is because different operating platforms use different character as the so-­‐called line

delimiter (or end-­‐of-­‐line or EOL). Two non-­‐printable control characters are involved: 0 AH (Line-­‐Feed or LF) and 0 DH (Carriage-­‐Return or CR). Windows/DOS uses OD 0 AH (CR+LF, "\r\n") as EOL. Unixes use 0 AH (LF, "\n") only. Mac uses 0 DH (CR, "\r") only.

End-­‐of-­‐File (EOF) : [TODO]

5. 10 Windows' CMD Codepage Character encoding scheme (charset) in Windows is called codepage. In CMD shell, you can issue command "chcp" to display the current codepage, or "chcp codepage-­‐number" to change the codepage. Take note that: The default codepage 437 (used in the original DOS) is an 8 -­‐bit character set called Extended ASCII , which is different from Latin-­‐ 1 for code numbers above 127. Codepage 1252 (Windows-­‐ 1252 ), is not exactly the same as Latin-­‐ 1. It assigns code number 80 H to 9 FH to letters and punctuation, such as smart single-­‐quotes and double-­‐quotes. A common problem in browser that display quotes and apostrophe in question marks or boxes is because the page is supposed to be Windows-­‐ 1252 , but mislabelled as ISO-­‐ 8859 -­‐ 1. For internationalization and chinese character set: codepage 65001 for UTF 8 , codepage 1201 for UCS-­‐ 2 BE, codepage 1200 for UCS-­‐ 2 LE, codepage 936 for chinese characters in GB 2312 , codepage 950 for chinese characters in Big 5. 5. 11 Chinese Character Sets Unicode supports all languages, including asian languages like Chinese (both simplified and traditional characters), Japanese and Korean (collectively called CJK). There are more than 20 , 000 CJK characters in Unicode. Unicode characters are often encoded in the UTF-­‐ 8 scheme, which unfortunately, requires 3 bytes for each CJK character, instead of 2 bytes in the unencoded UCS-­‐ 2 (UTF-­‐ 16 ). Worse still, there are also various chinese character sets, which is not compatible with Unicode: GB 2312 /GBK: for simplified chinese characters. GB 2312 uses 2 bytes for each chinese character. The most significant bit (MSB) of both bytes are set to 1 to co-­‐exist with 7 -­‐bit ASCII with the MSB of 0. There are about 6700 characters. GBK is an extension of GB 2312 , which include more characters as well as traditional chinese characters. BIG 5 : for traditional chinese characters BIG 5 also uses 2 bytes for each chinese character. The most significant bit of both bytes are also set to 1. BIG 5 is not compatible with GBK, i.e., the same code number is assigned to different character. For example, the world is made more interesting with these many standards:

Standard Characters Codes

Simplified GB 2312 谐 BACD D 0 B 3 USC-­‐ 2 谐 548 C 8 C 10 UTF-­‐ 8 谐 E 5928 C E 8 B 090 Traditional BIG 5 A 94 D BFD 3 UCS-­‐ 2 548 C 8 AE 7 UTF-­‐ 8 E 5928 C E 8 ABA 7

Notes for Windows' CMD Users : To display the chinese character correctly in CMD shell, you need to choose the correct

codepage, e.g., 65001 for UTF 8 , 936 for GB 2312 /GBK, 950 for Big 5 , 1201 for UCS-­‐ 2 BE, 1200 for UCS-­‐ 2 LE, 437 for the original DOS. You can use command "chcp" to display the current code page and command "chcp codepage_number " to change the codepage. You also have to choose a font that can display the characters (e.g., Courier New, Consolas or Lucida Console, NOT Raster font). 5. 12 Collating Sequences (for Ranking Characters) A string consists of a sequence of characters in upper or lower cases, e.g., "apple", "BOY", "Cat". In sorting or comparing strings, if we order the characters according to the underlying code numbers (e.g., US-­‐ASCII) character-­‐by-­‐character, the order for the