Understanding RISC Architecture and Memory Organization in Computer Systems - Prof. Edward, Study notes of Computer Architecture and Organization

An insight into the risc (reduced instruction set computer) architecture and its memory organization. Risc focuses on reducing the number and complexity of instructions, allowing for faster execution and efficient use of memory. The role of the memory buffer register (mbr/mdr) in passing data to and retrieving data from memory, as well as the advantages of risc over cisc (complex instruction set computer) in terms of instruction decoding time and memory bandwidth. It also touches upon the importance of registers and cache memory in risc systems.

Typology: Study notes

Pre 2010

Uploaded on 08/04/2009

koofers-user-aew
koofers-user-aew 🇺🇸

10 documents

1 / 28

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Chapter 2 Computer Systems Organization
This chapter contains notes to accompany the second chapter of the textbook Structured
Computer Organization (Fifth Edition) by Andrew S. Tanenbaum.
Processors: CPU Organization
A typical computer has four main components. These are:
1. The CPU (Central Processing Unit),
2. The Main Memory (also called “Core Memory” for historical reasons),
3. A set of I/O Devices, such as a NIC (Network Interface Card), CD–ROM,
Disk, Keyboard, and Video Display, and
4. At least one system bus to allow the other components to communicate.
The CPU, in turn, comprises a number of important subsystems.
1. The ALU (Arithmetic Logic Unit),
2. The Control Unit,
3. A set of registers, including user (general purpose) and special purpose register,
4. An internal set of CPU buses to allow these components to communicate.
One should note that all modern operating systems and hardware architectures support virtual
memory, which is an arrangement in which a large disk drive serves as a secondary memory
device. It is for this reason that a disk drive is often considered a memory device. As a disk
drive is accessed via mechanisms associated with the I/O system, it can also be called an I/O
device. We shall preserve this ambiguity through the course and refer to disk drives within
whichever context appears to be more convenient at the time of writing.
One should also note that the term “bus” normally refers to the main system bus used to
connect the top–level components of the computer.
The following figure presents another representation of the top–level organization of a
typical “von Neumann style” stored program computer.
Page 1 of 28 pages CPSC 5155 Revised August 28, 2006
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c

Partial preview of the text

Download Understanding RISC Architecture and Memory Organization in Computer Systems - Prof. Edward and more Study notes Computer Architecture and Organization in PDF only on Docsity!

This chapter contains notes to accompany the second chapter of the textbook Structured Computer Organization (Fifth Edition) by Andrew S. Tanenbaum. Processors: CPU Organization A typical computer has four main components. These are:

  1. The CPU (Central Processing Unit),
  2. The Main Memory (also called “Core Memory” for historical reasons),
  3. A set of I/O Devices, such as a NIC (Network Interface Card), CD–ROM, Disk, Keyboard, and Video Display, and
  4. At least one system bus to allow the other components to communicate. The CPU, in turn, comprises a number of important subsystems.
  5. The ALU (Arithmetic Logic Unit),
  6. The Control Unit,
  7. A set of registers, including user (general purpose) and special purpose register,
  8. An internal set of CPU buses to allow these components to communicate. One should note that all modern operating systems and hardware architectures support virtual memory, which is an arrangement in which a large disk drive serves as a secondary memory device. It is for this reason that a disk drive is often considered a memory device. As a disk drive is accessed via mechanisms associated with the I/O system, it can also be called an I/O device. We shall preserve this ambiguity through the course and refer to disk drives within whichever context appears to be more convenient at the time of writing. One should also note that the term “bus” normally refers to the main system bus used to connect the top–level components of the computer. The following figure presents another representation of the top–level organization of a typical “von Neumann style” stored program computer.

CPU: Special–Purpose Registers The CPU has a number of special–purpose registers that are of interest at this time. The Program Counter (PC) contains the address of the instruction to be executed next. As a part of the fetch cycle (see the next section), the instruction at the address is fetched and the PC is incremented to point to the next instruction. Any branch involves updating the contents of the PC with the new target address. It is worth noting that the PC is better named by the nonstandard term “ IP ” or “Instruction Pointer” used in Intel documentation. As the textbook notes, the PC does not count anything; the origin of the terminology is quite obscure. The MAR (Memory Address Register) is used to address RAM (Random Access Memory). The MBR (Memory Buffer Register) , also called MDR (Memory Data Register) is used to pass data to memory and retrieve data from memory. It is worth note that the CPU design discussed in Chapter 4 of the textbook calls for a MBR and a separate MDR, but that such usage is non–standard. The IR (Instruction Register) is that register that holds the binary machine code for the instruction currently under execution. We shall discuss the interaction of these registers in the next section, in which we discuss the Fetch–Execute cycle. The structure of a typical CPU is shown in the next figure, taken from Tanenbaum’s textbook. This is also called the “data path” , referring to the fact that it shows the flow of data from one or two source registers into the ALU and then back to the destination. The symbol used for the ALU is a standard one, reflecting the fact that the ALU will have two inputs to accommodate many binary arithmetic operations (such as addition) and one output to produce the results. We shall later see that most modern CPU’s internally have a three bus structure. Often a CPU will have one of its general purpose registers, usually called “Register 0”, set to the constant value 0. This greatly facilitates the design of the control unit.

RISC vs. CISC Computers One of the recent developments in computer architecture is called by the acronym RISC. Under this classification, a design is either RISC or CISC, with the following definitions. RISC R educed I nstruction S et C omputer CISC C omplex I nstruction S et C omputer. The definition of CISC architecture is very simple – it is any design that does not implement RISC architecture. We now define RISC architecture and give some history of its evolution. The source for these notes is the book Computer Systems Design and Architecture, by Vincent P. Heuring and Harry F. Jordan. One should note that while the name “RISC” is of fairly recent origin (dating to the late 1970’s) the concept can be traced to the work of Seymour Cray, then of Control Data Corporation, on the CDC–6400 and related machines. Mr. Cray did not think in terms of a reduced instruction set, but in terms of a very fast computer with a well-defined purpose – to solve complex mathematical simulations. The resulting design supported only two basic data types (integers and real numbers) and had a very simple, but powerful, instruction set. Looking back at the design of this computer, we see that the CDC–6400 could have been called a RISC design. As we shall see just below, the entire RISC vs. CISC evolution is driven by the desire to obtain maximum performance from a computer at a reasonable price. Mr. Cray’s machines maximized performance by limiting the domain of the problems they would solve. The general characteristic of CISC architecture is the emphasis on doing more with each instruction. This may involve complex instructions and complex addressing modes; for example the MC68020 processor supports 25 addressing modes. The ability to do more with each instruction allows more operations to be compressed into the same program size, something very desirable if memory costs are high. Some historical data will illustrate the memory issue. Time Cost of memory Cost of disk drive Introduction of MC6800 $500 for 16KB RAM $55,000 for 40 MB Introduction of MC68000 $200 for 64 KB RAM $5,000 for 10 MB Micron (4/10/2002) $49 for 128 MB RAM $149 for 20 GB Dell (8/26/2006) $214 for 1 GB RAM $380 for 160 GB $650 for 750 GB Another justification for the CISC architectures was the “semantic gap”, the difference between the structure of the assembly language and the structure of the high level languages (COBOL, C++, Visual Basic, FORTRAN, etc.) that we want the computer to support. It was expected that a more complicated instruction set (more complicated assembly language) would more closely resemble the high level language to be supported and thus facilitate the creation of a compiler for the assembly language.

One of the first motivations for the RISC architecture came from a careful study of the implications of the semantic gap. Experimental studies conducted in 1971 by Donald Knuth and 1982 by David Patterson showed that nearly 85% of a programs statements were simple assignment, conditional, or procedure calls. None of these required a complicated instruction set. It was further notes that typical compilers translated complex high level language constructs into simpler assembly language statements, not the complicated assembly language instructions that seemed more likely to be used. The results of this study are quoted from an IEEE Tutorial on RISC architecture [R05]. This table shows the percentages of program statements that fall into five broad classifications. Language Pascal FORTRAN Pascal C SAL Workload Scientific Student System System System Assignment 74 67 45 38 42 Loop 4 3 5 3 4 Call 1 3 15 12 12 If 20 11 29 43 36 GOTO 2 9 -- 3 -- Other 7 6 1 6 The authors of this study made the following comments on the results. “There is quite good agreement in the results of this mixture of languages and applications. Assignment statements predominate, suggesting that the simple movement of data is of high importance. There is also a preponderance of conditional statements (If, Loop). These statements are implemented in machine language with some sort of compare and branch instruction. This suggests that the sequence control mechanism is important.” The “bottom line” for the above results can be summarized as follows.

  1. As time progresses, more and more programs will be written in a compiled high- level language, with much fewer written directly in assembly language.
  2. The compilers for these languages do not make use of the complex instruction sets provided by the architecture in an attempt to close the semantic gap. In 1979, the author of these notes attended a lecture by a senior design engineer from IBM. He was discussing a feature of an architecture that he designed: he had put about 6 months of highly skilled labor into implementing a particular assembly language instruction and then found that it was used less than 1/10,000 of a percent of the time by any compiler. So the “semantic gap” – the desire to provide a robust architecture for support of high-level language programming turned out to lead to a waste of time and resources. Were there any other justifications for the CISC design philosophy?

The narrative from the tutorial continues with remarks on the RISC architectures developed at the University of California at Berkeley. “Although each project [the Berkeley RISC I and RISC II and the IBM 801] had different constraints and goals, the machines they eventually created have a great deal in common.

  1. Operations are register-to-register, with only LOAD and STORE accessing memory.
  2. The operations and addressing modes are reduced. Operations between registers complete in one cycle, permitting a simpler, hardwired control for each RISC, instead of microcode. Multiple-cycle instructions such as floating-point arithmetic are either executed in software or in a special-purpose processor. (Without a coprocessor, RISC’s have mediocre floating-point performance.) Only two simple addressing modes, indexed and PC–relative, are provided. More complicated addressing modes can be synthesized from the simple ones.
  3. Instruction formats are simple and do not cross word boundaries. This restriction allows RISC’s to remove instruction decoding time from the critical execution path. … RISC register operands are always in the same place in the 32-bit word, so register access can take place simultaneously with opcode decoding. This removes the instruction decoding stage from the pipelined execution, making it more effective by shortening the pipeline.” There are a number of other advantages to the RISC architecture. We list a few Better Access to Memory Better Support of Compilers According to the IEEE Tutorial “Register-oriented architectures have significantly lower data memory bandwidth. Lower data memory bandwidth is highly desirable since data access is less predictable than instruction access and can cause more performance problems.” We note that, even at 6.4 GB/second data transfer rates, access to memory is still a bottleneck in modern computer design, so any design that reduces the requirement for memory access (here called reducing the memory bandwidth) would be advantageous. Better Support of Compilers According to the IEEE Tutorial “The load/store nature of these [existing RISC] architectures is very suitable for effective register allocation by the compiler; furthermore, each eliminated memory reference results in saving an entire instruction.” We note here that more effective register allocation by a compiler will usually result in faster–running code. We see this as another advantage of the RISC design.

Instruction Pre-Fetching One advantage of the RISC architecture is seen in the process referred to as instruction pre-fetching. In this process, we view the fetch-execute process as a pipeline. In a traditional fetch-execute machine, the instruction is first fetched from memory and then executed. At least as early as the IBM Stretch (1959), it was recognized that the fetch unit should be doing something during the time interval for executing the instruction. The logical thing for the fetch unit to do was to fetch the instruction in the next memory location on the chance that it would be the instruction that would be executed next. This process has been shown to improve computer performance significantly. The logic to pre-fetch instructions is facilitated by the RISC design philosophy that all instructions are the same size, so in a machine based on 32-bit words the pre-fetch unit just grabs the next four bytes. Instruction pre–fetching appears rather simple, except in the presence of program jumps, such as occur in the case of conditional branches and the end of program loops. A lot of work has gone into prediction of the next instruction in such cases, where there are two instructions that could be executed next depending on some condition. It may be possible to execute both candidate instructions and discard the result of the instruction not in the true execution path. Implications for the Control Unit The complex instructions in a CISC computer tend to require more support in the execution than can conveniently be provided by a hardwired control unit. For this reason, most CISC computers are microprogrammed to handle the complexity of each of the instructions. For this reason, most CISC instructions require a number of system clock cycles to execute. The RISC approach emphasizes use of a simpler instruction set that can easily be supported by a hardwired control unit. As a side effect, most RISC instructions can be executed in one clock cycle. A given computer program will compile into more RISC instructions than CISC instructions, but the CISC instructions execute more slowly than the RISC instructions. The overall effect on the computer program may be hard to predict. According to the IEEE tutorial “Reducing the instruction set further reduces the work a RISC processor has to do. Since RISC has fewer types of instructions than CISC, a RISC instruction requires less processing logic to interpret than a CISC instruction. The effect of such simplification is to speed up the execution rate for RISC instructions. In a RISC implementation it is theoretically possible to execute an instruction each time the computer’s logic clock ticks. In practice the clock rate of a RISC processor is usually three times that of the instruction rate.”

Digital Equipment Corporation (DEC) undertook an experiment to investigate this correlation and produced a design yielding interesting, but not surprising, results, which are again quoted from the IEEE tutorial on RISC architecture. “DEC reported a subsetting experiment on two implementations of the VAX architecture in VLSI. The VLSI VAX has nine custom VLSI chips and implements the complete VAX–11 instruction set. DEC found that 20.0 percent of the instructions are responsible for 60.0 percent of the microcode and yet are only 0.2 percent of all instructions executed. By trapping to software to execute these instructions, the MicroVAX 32 was able to fit the subset architecture onto only one chip, with an optional floating–point processor on another chip. .. The VLSI VAX uses five to ten times the resources of the MicroVAX 32 to implement the full instruction set, yet is only 20 percent faster.” VLSI VAX MicroVAX 32 VLSI Chips 9 2 Microcode 480K 64K Transistors 1250K 101K The result is obvious – a simple and cheaper computer will do most of what you want. The rest can be better done in software. Design Principles for Modern Computers Experience with the RISC designs has lead to the acceptance of a number of design practices. These practices reflect the state of the art in a number of key areas. These are:

  1. The fact that most programs are written in high–level compiled languages.
  2. The fact that current CPU clock cycle times (0.25 – 0.50 nanoseconds) are much faster than memory access times (50 nsec for DRAM, 5 nsec for SRAM).
  3. The fact that a simpler instruction set implies a smaller control unit, thus freeing chip area for more registers and on–chip cache. As the textbook notes, the invention of a significantly faster memory technology, while welcome, would force the development of new computer architectures. Most computer manufacturers would cry all the way to the bank. As stated in the discussion of the RISC architecture, the presently accepted design principles for the current architectures are as follows.
  4. Implement the microarchitecture level to provide direct hardware execution of the more common instructions with micro–routines for the more complex instructions.
  5. Use pipelining (see notes below) and maximize the rate at which instructions are issued to the execution units.
  6. Minimize the number of instruction formats and make them simpler, so that the instructions are more easily and quickly decoded.
  7. Minimize the number of instructions that reference memory.
  8. Provide plenty of registers and the largest possible on–chip cache memory.

Instruction–Level Parallelism Everybody wants a faster computer. It is easy to describe a number of significant problems which would admit useful numerical solutions if much faster computers were available. Here is a small list of such problems.

  1. Weather forecasting, especially prediction of tornado formation and the path and intensity of hurricanes. Accurate prediction of hurricane tracks for the next five days would greatly facilitate the civil defense aspects of preparation.
  2. Realistic simulations of air flow over the bodies and wings of aircraft. We currently have an excellent set of equations for very accurate description of the flow of any fluid (including air), but not the computational resources to solve these equations.
  3. Predicting the folding of proteins. Proteins are complex three–dimensional chemical structures with interaction properties strongly dependent on their structure. Accurate prediction of how these fold into shape would allow better drug development. There are a number of solutions that have been attempted to increase the performance of modern digital computers. We begin with the most obvious method, one that has served well in the past few decades, but which holds less promise for the future. We then discuss a number of architectures that have been used well in the past and which still hold promise. The most obvious brute–force way to increase the performance of a computer is to speed up the clock, thus making a faster computer. This strategy has worked in the past, mostly due to the observation commonly called “Moore’s Law”. While the law is stated in terms of the number of transistors on a chip, its speed implication comes from the decrease in chip size that this increased transistor density allows. The trouble with these “hot” chips is that they are literally too hot; the increased clock speed yields a lot of heat. At present, there are experimental functioning CPU chips that generate too much heat to be handled by any method thought commercially viable. Naturally, these have not been commercialized. Instruction–Level Parallelism If we cannot execute a single process more quickly, the next design option is to divide the computation into multiple steps that can be executed in parallel. There are two general ways by which this goal may be achieved: instruction–level parallelism and processor–level parallelism. We now discuss instruction–level parallelism. The two most common variants of instruction–level parallelism are pipelining and superscalar architecture. The simplest form of pipelining is instruction prefetch , a technique dating back at least as far as the IBM Stretch (1959). In this technique, the fetching of one instruction is done in parallel with the execution of the previous instruction. Put another way, the fetch–execute cycle is broken into separate processes, performed in parallel. The prefetch unit will place the next instruction into a prefetch buffer , which might be implemented as a queue of instructions stored in a set of CPU registers.

It is often the case that the instruction execution unit is the slowest unit, forming the “bottleneck” in the pipeline. The superscalar architecture is designed to address this stage and leave the more efficient stages as single units. We use Tanenbaum’s figure to illustrate. Processor–Level Parallelism If one complete processor is good, why not use two or more complete processors. There are a number of arrangements for parallel processing, generally classified as multiprocessor designs and multicomputer designs. Multiprocessor designs include the dual–core designs found on high–end CPU chips, single–bus multiprocessors, and array computers. As noted above, a dual–core design places two complete CPUs on a single CPU chip; it being easier to run two CPUs at a given speed than one CPU at twice the speed. Multicomputer designs call for a number of computers complete with individual memory units that communicate by some “message passing” mechanism. This may involve anything from access to a large common memory unit to the use of TCP/IP to send Internet messages. The primary difficulty in any multiprocessor or multicomputer design is the sharing of data between the CPUs. There are many important problems that can be divided into subproblems with limited interaction. Consider weather modeling, in which the area of the continental U.S. can be divided into areas with interaction only at the boundaries. Although it would not actually be done this way, we could imagine a separate processor for each of the states Mississippi, Alabama, Florida, Georgia, South Carolina, North Carolina, Tennessee, Kentucky, and Virginia. The processor modeling the Georgia weather would communicate directly only with the processors for Alabama, Florida, North and South Carolina, and Tennessee as these are the states directly bordering Georgia. The effects of weather in Kentucky would be transmitted to the state of Georgia indirectly through the Tennessee processor as the Mississippi weather would be mediated by the Alabama processor. There are several interesting “screen saver” multicomputer applications worth mentioning. In one of these, an individual user connected to the Internet volunteers his computer as computational server, to be used during idle time in which there is no other use for it. The user downloads a program that is linked to the screen saver and then processes discrete data collections also downloaded from a central site. One of these is the SETI program.

Primary Memory By this we indicate memory that can be directly addressed by the CPU. This is often called “core memory”, due to the fact that it used to be implemented with a technology that used magnetic cores as binary storage devices. It is also called “store” or “storage”. The figure below shows a small core memory unit, probably from the late 1960’s or the 1970’s. Note the individual core units in the enlargement in the left picture. Beginning with the ENIAC in 1946 (and some earlier machines), computer memory has been organized around binary arithmetic, storing data as binary bits. While it is true that the ENIAC was a decimal machine, performing base–10 arithmetic, each digit was stored in BCD ( B inary C oded D ecimal) form. In BCD, each digit is coded as four binary bits, with the obvious encodings. Digit BCD Code 0 0000 1 0001 2 0010 3 0011 4 0100 5 0101 6 0110 7 0111 8 1000 9 1001 There are a number of reasons for the choice of binary encoding of all data in a computer.

  1. The ease of building reliable two–state devices.
  2. The direct nature of the implementation of binary arithmetic on these devices.
  3. Voltage and noise immunity considerations. In standard TTL circuitry, we assign voltage level 0.0 to 0.8 volts as logic 0 and 2.8 to 5.0 volts as logic 1. With this design, the inevitable noise voltages are easily corrected.
  4. The fact that early devices used relays as primary components; these are essentially binary devices that are “on” or “off”. Vacuum tubes show the same property.

Of course, there is no such thing as a pure Read-Only memory; at some time it must be possible to put data in the memory by writing to it, otherwise there will be no data in the memory to be read. The term “Read-Only” usually refers to the method for access by the CPU. All variants of ROM share the feature that their contents cannot be changed by normal CPU write operations. All variants of RAM (really Read-Write Memory) share the feature that their contents can be changed by normal CPU write operations. Some forms of ROM have their contents set at time of manufacture, other types called PROM (Programmable ROM), can have contents changed by special devices called PROM Programmers. Registers associated with the memory system All memory types, both RAM and ROM can be characterized by two registers and a number of control signals. Consider a memory of 2N^ words, each having M bits. Then the MAR ( Memory Address Register ) is an N-bit register used to specify the memory address the MBR ( Memory Buffer Register ) is an M-bit register used to hold data to be written to the memory or just read from the memory. This register is also called the MDR (Memory Data Register). We specify the control signals to the memory unit by recalling what we need the unit to do. First consider RAM (Read Write Memory). From the viewpoint of the CPU there are three tasks for the memory CPU reads data from the memory. Memory contents are not changed. CPU writes data to the memory. Memory contents are updated. CPU does not access the memory. Memory contents are not changed. We need two control signals to specify the three options for a RAM unit. One standard set is Select – the memory unit is selected. R / W – if 0 the CPU writes to memory, if 1 the CPU reads from memory. Select R^ /W Action 0 0 Memory contents are not changed. 0 1 Memory contents are not changed. 1 0 CPU writes data to the memory. 1 1 CPU reads data from the memory. We can use a truth table to specify the actions for a RAM. Note that when Select = 0, nothing is happening to the memory. It is not being accessed by the CPU and the contents do not change. When Select = 1, the memory is active and something happens. Consider now a ROM (Read Only Memory). Form the viewpoint of the CPU there are only two tasks for the memory: CPU reads data from the memory. CPU does not access the memory. We need only one control signal to specify these two options. The natural choice is the Select control signal as the R /Wsignal does not make sense if the memory cannot be written by the CPU. The truth table for the ROM should be obvious Select Action 0 CPU is not accessing the memory. 1 CPU reads data from the memory.

The Idea of Address Space We now must distinguish between the idea of address space and physical memory. The address space defines the range of addresses (indices into the memory array) that can be generated. The size of the physical memory is usually somewhat smaller, this may be by design (see the discussion of memory-mapped I/O below) or just by accident. An N-bit address will specify 2N^ different addresses. In this sense, the address can be viewed as an N-bit unsigned integer; the range of which is 0 to 2N^ – 1 inclusive. We can ask another question: given K addressable items, how many address bits are required. The answer is given by the equation 2 (N - 1)^ < K2 N , which is best solved by guessing N. The memory address is specified by a binary number placed in the Memory Address Register (MAR). The number of bits in the MAR determines the range of addresses that can be generated. N address lines can be used to specify 2N^ distinct addresses, numbered 0 through 2 N^ – 1. This is called the address space of the computer. For example, we show three MAR sizes. Computer MAR bits Address Range PDP-11/20 16 0 to 65 535 Intel 8086 20 0 to 1 048 575 Intel Pentium 32 0 to 4 294 967 295 The PDP-11/20 was an elegant small machine made by the now defunct Digital Equipment Corporation. As soon as it was built, people realized that its address range was too small. In general, the address space is much larger than the physical memory available. For example, my personal computer has an address space of 2^32 (as do all Pentiums), but only 384MB = 2^28 + 2^27 bytes. Until recently the 32-bit address space would have been much larger than any possible amount of physical memory. At present one can go to a number of companies and order a computer with a fully populated address space; i.e., 4 GB of physical memory. Most high-end personal computers are shipped with 1GB of memory. In a design with memory–mapped I/O part of the address space is dedicated to addressing I/ O registers and not physical memory. For example, in the original PDP-11/20, the top 4096 (2^12 ) of the address space was dedicated to I/O registers, leading to the memory map. Addresses 0 – 61439 Available for physical memory Addresses 61440 – 61535 Available for I/O registers (61440 = 61536 – 4096) We shall return to memory–mapped I/O in a later discussion. For the moment, we need to recall that each I/O device is accessed via a set of registers, including status registers, data registers, and control registers. Each I/O device is assigned a sequence of addresses (either in a unified address space or a dedicated I/O address space) that are used to access these registers and allow the CPU to communicate with the devices.

Big-Endian and Little-Endian The reference here is to a story in Gulliver’s Travels written by Jonathan Swift in which two groups of men went to war over which end of a boiled egg should be broken – the big end or the little end. The student should be aware that Swift did not write pretty stories for children but focused on biting satire; his work A Modest Proposal is an excellent example. Consider the 32-bit number represented by the eight-digit hexadecimal number 0x01020304, stored at location Z in memory. In all byte-addressable memory locations, this number will be stored in the four consecutive addresses Z, (Z + 1), (Z + 2), and (Z + 3). The difference between big-endian and little-endian addresses is where each of the four bytes is stored. In our example 0x01 represents bits 31 – 24, 0x02 represents bits 23 – 16, 0x03 represents bits 15 – 8, and 0x04 represents bits 7 – 0 of the word. As a 32-bit signed integer, the number 0x01020304 can be represented in decimal notation as 1  166 + 0 165 + 2 164 + 0 163 + 3 162 + 0 161 + 4 160 = 16,777,216 + 131,072 + 768 + 4 = 16,909,060. For those who like to think in bytes, this is (01) 166 + (02) 164 + (03) 162 + 04, arriving at the same result. Note that the number can be viewed as having a “big end” and a “little end”, as in the following figure. The “big end” contains the most significant digits of the number and the “little end” contains the least significant digits of the number. We now consider how these bytes are stored in a byte-addressable memory. Recall that each byte, comprising two hexadecimal digits, has a unique address in a byte-addressable memory, and that a 32-bit (four-byte) entry at address Z occupies the bytes at addresses Z, (Z + 1), (Z + 2), and (Z + 3). The hexadecimal values stored in these four byte addresses are shown below. Address Big-Endian Little-Endian Z 01 04 Z + 1 02 03 Z + 2 03 02 Z + 3 04 01 Just to be complete, consider the 16-bit number represented by the four hex digits 0A0B. Suppose that the 16-bit word is at location W; i.e., its bytes are at locations W and (W + 1). The most significant byte is 0x0A and the least significant byte is 0x0B. The values in the two addresses are shown below. Address Big-Endian Little-Endian W 0A 0B W + 1 0B 0A

The figure below shows a graphical way to view these two options for ordering the bytes copied from a register into memory. We suppose a 32-bit register with bits numbered from 31 through 0. Which end is placed first in the memory – at address Z? For big-endian, the “big end” or most significant byte is first written. For little-endian, the “little end” or least significant byte is written first. There seems to be no advantage of one system over the other. Big-endian seems more natural to most people and facilitates reading hex dumps (listings of a sequence of memory locations), although a good debugger will remove that burden from all but the unlucky. Big-endian computers include the IBM 360 series, Motorola 68xxx, and SPARC by Sun. Little-endian computers include the Intel Pentium and related computers. The big-endian vs. little-endian debate is one that does not concern most of us directly. Let the computer handle its bytes in any order desired as long as it produces good results. The only direct impact on most of us will come when trying to port data from one computer to a computer of another type. Transfer over computer networks is facilitated by the fact that the network interfaces for computers will translate to and from the network standard, which is big-endian. The major difficulty will come when trying to read different file types. The big-endian vs. little-endian debate shows in file structures when computer data are “serialized” – that is written out a byte at a time. This causes different byte orders for the same data in the same way as the ordering stored in memory. The orientation of the file structure often depends on the machine upon which the software was first developed. The following is a partial list of file types taken from a textbook once used by this author. Little-endian Windows BMP, MS Paintbrush, MS RTF, GIF Big-endian Adobe photoshop, JPEG, MacPaint Some applications support both orientations, with a flag in the header record indicating which is the ordering used in writing the file. Any student who is interested in the literary antecedents of the terms “big-endian” and “little- endian” may find a quotation at the end of this chapter.