Vector Processing: Understanding Data Streams and Vector Instructions in Computer Systems , Study notes of Computer Science

Computer taxonomies and focuses on vector processing, a method used in computer systems to handle multiple data streams concurrently. Vector processors use single or multiple instruction streams to manipulate single or multiple data streams. The advantages and disadvantages of vector processing, including improved code density, reduced number of instructions, and data dependencies. It also covers vector registers, functional units, and the organization of data in vector processors.

Typology: Study notes

Pre 2010

Uploaded on 08/05/2009

koofers-user-1cr
koofers-user-1cr 🇺🇸

9 documents

1 / 9

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Vector Processing
Prof. Loh
CS3220 - Processor Design - Spring 2005
April 5, 2005
1 Computer Taxonomies
There are many possible ways to describe computersystems, and many different taxonomies or categorization systems
have been proposed. The most common is due to Flynn, called Flynn’s Taxonomy. This system divides machines up
depending on (1) how many instruction streams there are and (2) how many data streams there are. By instruction
streams, we mean separate and independent paths of control and state, that is an instruction stream has its own program
counter, and its own registers. By data streams, we mean independent pieces of data. A vector of multiple pieces of
data may comprise multiple data streams, and independent memory values that are handled concurrently may also
comprise separate data streams. Each of the two attributes, Instructions and Data, may be classified as Single or
Multiple. This yields four possible combinations:
Data Instruction Streams
Streams Single Multiple
Single SISD SIMD
Multiple MISD MIMD
Single Instruction Single Data (SISD): A single stream of instructions executes on one stream of data at a time.
This includes the traditional CISC CPUs, pipelined CPUs and also superscalar CPUs (because the parallelism
is hidden from the ISA).
Single Instruction Multiple Data (SIMD): A single stream of instructions may operate on multiple pieces of data
in parallel. Typical examples include vector architectures.
Multiple Instruction Single Data (MISD): Almost no real examples fall into this category.
Multiple Instruction Multiple Data (MIMD): Many independent instruction streams operate on separate sets
of data concurrently. Examples include massively parallel processors (“supercomputers”, such as those used
for weather modelling), shared memory SMP machines (like the dual processor Xeon boxes you could buy
yourself), or the upcoming dual-core processors (two cpus in on one chip).
Note that these are datastreams that are exposed to the programmer. Someone writing a program for a MIMD machine
must explicitly write more than one thread of execution (instruction stream) and make sure that the different data
handled in parallel are done so correctly. In a superscalar processor, multiple instructions may execute concurrently
working on distinct sets of data, but all of this occurs “beneath the covers”and is something that the programmer can
not see.
Why do we bother about taxonomies? The primary interest is that the categorization of systems can help understand
many of the pertinent issues of systems. Not just only in computer systems, but taxonomies in general assist in
evaluating different attributes in an organized manner which may shed some additional insight on tradeoffs or relative
importance of different parameters.
Flynn’s taxonomy is perhaps one of the most well known, and therefore commonly used. This is probably due to
its simplicity. Unfortunately, it is often a little too coarse, and so other refinements or extensions to Flynn’s taxonomy
1
pf3
pf4
pf5
pf8
pf9

Partial preview of the text

Download Vector Processing: Understanding Data Streams and Vector Instructions in Computer Systems and more Study notes Computer Science in PDF only on Docsity!

Vector Processing

Prof. Loh

CS3220 - Processor Design - Spring 2005

April 5, 2005

1 Computer Taxonomies

There are many possible ways to describe computer systems, and many different taxonomies or categorization systems have been proposed. The most common is due to Flynn, called Flynn’s Taxonomy. This system divides machines up depending on (1) how many instruction streams there are and (2) how many data streams there are. By instruction streams, we mean separate and independent paths of control and state, that is an instruction stream has its own program counter, and its own registers. By data streams, we mean independent pieces of data. A vector of multiple pieces of data may comprise multiple data streams, and independent memory values that are handled concurrently may also comprise separate data streams. Each of the two attributes, I nstructions and D ata, may be classified as S ingle or M ultiple. This yields four possible combinations:

Data Instruction Streams Streams Single Multiple Single SISD SIMD Multiple MISD MIMD

  • Single Instruction Single Data (SISD): A single stream of instructions executes on one stream of data at a time. This includes the traditional CISC CPUs, pipelined CPUs and also superscalar CPUs (because the parallelism is hidden from the ISA).
  • Single Instruction Multiple Data (SIMD): A single stream of instructions may operate on multiple pieces of data in parallel. Typical examples include vector architectures.
  • Multiple Instruction Single Data (MISD): Almost no real examples fall into this category.
  • Multiple Instruction Multiple Data (MIMD): Many independent instruction streams operate on separate sets of data concurrently. Examples include massively parallel processors (“supercomputers”, such as those used for weather modelling), shared memory SMP machines (like the dual processor Xeon boxes you could buy yourself), or the upcoming dual-core processors (two cpus in on one chip).

Note that these are datastreams that are exposed to the programmer. Someone writing a program for a MIMD machine must explicitly write more than one thread of execution (instruction stream) and make sure that the different data handled in parallel are done so correctly. In a superscalar processor, multiple instructions may execute concurrently working on distinct sets of data, but all of this occurs “beneath the covers” and is something that the programmer can not see. Why do we bother about taxonomies? The primary interest is that the categorization of systems can help understand many of the pertinent issues of systems. Not just only in computer systems, but taxonomies in general assist in evaluating different attributes in an organized manner which may shed some additional insight on tradeoffs or relative importance of different parameters. Flynn’s taxonomy is perhaps one of the most well known, and therefore commonly used. This is probably due to its simplicity. Unfortunately, it is often a little too coarse, and so other refinements or extensions to Flynn’s taxonomy

have been introduced over time. This lecture we focused on SIMD systems, starting with one of the grand-fathers, the Cray-1 supercomputer, and then looking at MMX/3DNow!/SSE which are some of the more modern reincarnations of the SIMD technology that was used back in the 70’s for the Cray.

2 Vector Processors

The pipelined and superscalar processors that we have been covering in this course are all uniprocessors , which are processors that have only a single thread of execution or simply a single program counter. Vector processors are also uniprocessors, and are in many way similar to more traditional SISD machines. There is a single program counter, and there are typically registers and functional units and so on. The difference is that in a vector machine, there are typically additional vector registers , vector functional units , and vector instructions to manipulate the vector data. A vector register may be composed of a large number of smaller scalar registers. Each scalar register is an independent number, traditionally interpreted as a floating point value (you can interpret it however you want since it’s all just a bunch of bits). Vector computations are more common for scientific computing where you may be modelling something or performing some mathematical computation (over continuous variables and functions), and so that is why floating point is typically used. High throughput (pipelined) vector functional units are used to process data from the vector registers at a high rate. To command the computer to perform these operations, special vector instructions must be used. A typical vector operation would be: −→ A =

B +

C

In a non-vector machine, such code would have to be compiled into a loop that iteratively adds the components of the vectors together in a point-wise fashion. In a vector machine, this is accomplished with a single instruction:

VOP Vdest Varg 1 Varg 2

Where VOP specifies a vector operation (such as add or multiply), and the V∗ arguments name vector registers. The advantages of vector instructions are:

  • Improved code density (more “work” fits in cache)
  • Reduce number of instructions needed to execute the program
  • Data are organized into regular patterns that can be efficiently handled by the hardware
  • Simple loop constructs can be replaced with a few vector instructions, thus eliminating the loop overhead.

Since a single (or a few) instructions can replace the entire body and control overhead of a loop, the number of instructions needed to specify how to perform a computation is potentially reduced. This increase in code density (roughly the number of instructions to perform some task or work) allows more useful “work” to fit inside the cache which results in fewer cache misses. Holding all other factors constant, reducing the number of instructions needed for a computation will decrease the runtime of the program (increase performance). Naturally all other factors are not exactly “constant”, but this can still help. The vector organization of data is very helpful since it implies that there are no data dependencies when instructions operate on the data. This means that the hardware can be much simpler than the superscalar logic for checking resource availability, inter-instruction dependencies, pairing rules, etc., while achieving a fair amount of parallelism. We saw that the control dependencies due to loops can cause degradations in performance due to interruptions in fetching instructions and the need for branch prediction. Furthermore, in a superscalar processor, the parallelism is hampered by the fact that the instructions can only be fetched into the core of the processor at a limited rate. A loop that is replaced by a few vector instructions basically allows every iteration of the loop to be fetched at the same time, thus exposing a large number of parallel operations that can be concurrently executed. The downsides of vector processing are:

V2 0

V3 0

VADD

V2 1 V2 2 V2 3

V3 1 V3 2 V3 3

V2n− 1 V3n− 1

V4 0

V1 0 V1 1

V4 1 V4n− 1 V1n− 1 VMUL

V1 =

V5 =

Figure 1: If there are enough functional units and register ports, vector instructions may be overlapped to further increase concurrency.

this creates is that 4 read ports from the vector registers are actually needed for this example. If, say, there are only 2 read ports, then the VMUL will not be able to proceed until after the last element of the preceding VADD instruction has run. This is very similar to data forwarding/bypassing, but in vector processing lingo, it is called chaining. A vector load instruction may cause timing issues since a particular element of the vector may not reside in memory (page miss). Because of this, some implementations do not allow the overlap of vector computations with an earlier dependent vector load. That is, the vector load must completely read in all values into the vector register before any further vector operations on that register may proceed. There are a few other instructions that may operate on vector data, but generate scalar results. One example is vector compare or test, where all elements of a vector are pair-wise compared with another vector, and the result is a vector of bits that store the results of each comparison. If there are 64 elements to a vector, than a single 64-bit scalar value can store the result of the vector comparison. Another common vector → scalar computation is the accumulate operation where the sum of all of the elements are stored in a single scalar register, or the sum of the pair wise product of two vectors is stored in a scalar register (the dot product). There is an additional issue of what to do with vectors that are not long enough to use up all of the elements in the vector register. One simple approach is to pad the remaining elements with zeros, and the result of any operation is simply as defined. If two vectors of different lengths are added, the “extra” items of the longer vector remain the same. If the two vectors are multiplied though, the extra items become zero. Another approach is to fill the extra elements with some invalid value, such as the IEEE NaN (Not a Number). The result of any operation that has a NaN as an argument is also a NaN. In this manner, the result of either addition or multiplication (in our example) is that the extra items in the result vector will always by NaN.

2.2 Example Vector Processor: Cray-

2.2.1 Physical Construction

A great quote from an old article from the Communications of the ACM about the Cray-1:

The CRAY-1 has been called “the world’s most expensive love-seat”. Certainly, most people’s first reaction to the CRAY-1 is that it is so small.”

Now take a look at the physical organization of the machine shown in Figure 2. The quote really puts things into perspective. These days, we are used to having machines that fit on or under our desks, and even a tower case for a server may seem quite “large” to us. On the other hand, the CRAY-1 is about 6 and a half feet tall and over 8. feet in diameter! The system is organized as 12 wedge shaped columns, arranged in a circular fashion encompassing 270 o. This was done to minimize the distances of the wires going between the different modules. The last quadrant was left open so that a person could get access to the modules inside the computer to either connect devices or for repair/diagnostic reasons. In the words of the article:

This leaves room for a reasonably trim individual to gain access to the interior of the machine.

The interesting “vital stats” of the CRAY-1 are as follows:

19”

77”

56.5”

103.5”

Figure 2: The physical dimensions of the CRAY-1.

  • 1662 modules, 113 module types
  • Up to 288 IC packages per modules
  • For configuration with maximum allowable memory, power consumption is approximately 115 kw (over 380 modern computers which are about 300 watts each)
  • Freon cooled with Freon/water heat exchange
  • 10,500 lbs. (w/ maximum memory)
  • composed of only three basic chip types: - 5- and 4-input NAND gates - Memory chips - Register chips - (That’s All!)

The basic architecture includes 16 memory banks (interleaved/banked memory), 12 functional units, and more than 4KB worth of register storage (both scalar and vector). The other key implementation features of the CRAY-1 are that only four chip (IC) types are used, the speed of the main memory, the cooling system, and the computational core of course.

2.2.2 Four IC’s

All logic in the CRAY is completely implemented with a mix of high- and low- speed 4- and 5-input NAND gates. Registers are implemented with 16 × 4 bit chips (with 6ns access time), and memory is implemented with 1024 × 1 bit memory chips (with 50ns access time). The overall clock cycle time is 12.5ns (80MHz), which is incredibly fast for the late 70’s (consider that in 1990, an i386 at 16MHz was pretty fast)! Each IC comes in a 16-pin flat package. The small size of the ICs were important since a single module (small PCB) may contain up to 288 ICs, and up to 72 such modules can be inserted into a 28” high chassis (there are 2 chassis

  • Sixty Four 24-bit address-save registers (B)
  • Eight 64-bit scalar registers (S)
  • Sixty Four 64-bit scalar-save registers (T)
  • Eight 64- word vector registers (V) - 4096-bits/vector Unlike what was previously described, the vector registers in the CRAY-1 are not explicitly “floating-point”. They can hold any 64-bit value, regardless of whether the hardware interprets the values as 64-bit integers or floating point format numbers. The address-save and scalar-save registers sort of act like a cache for the computer (there is no other cache hierar- chy). The difference is that these locations are made explicit to the programmer. This would be analogous to a modern processor with instructions that can load a value from memory to a particular cache location, and other instructions to load from particular cache locations into a register (and the other direction for stores). It only takes a single clock cycle to move a value from a register to a save-register (same for the other direction), but the normal operational instructions (arithmetic, etc.) can not operate on the save-registers. This complicates the programming task somewhat since the allocation and management of data in the save registers is completely up to the programmer and/or compiler. Contrast this to a more traditional cache organization where everything is transparent to the programmer. There are two additional registers to support vector operations. The register VL holds the length of the vectors being manipulated. For vectors that are shorter than 64 elements, cycles are not wasted processing the unused elements at the ends of the vectors. For vectors longer than 64 elements, the data are then split across multiple vector registers. The register VM is a 64-bit vector mask register. The result of a 64-bit vector comparison for example is placed in the VM register. The VM register interacts with the vector control logic for some operations to specify which elements of the vectors to process. For example, if the contents of the VM register are alternating 1’s and 0’s, then only every other vector element is processed. There is no virtual memory in the system in the way we are used to, but a different technique is used to prevent different processes from accessing each others’ memory. Two registers BA and LA are used to define the memory boundaries for the current running process. BA is the base address, and LA is the limit address. Any time a memory reference is made, the address in question is checked to see that it is greater than or equal to BA, and less than LA (is the address in the allowed range?). If the address does not fall in this interval, than a memory access violation is flagged (interrupt), usually causing the process to terminate. The operating system must know exactly how much memory each process is going to use, and then it can set these registers accordingly.

3 Superscalar SIMD Extensions

Main stream processing (IE, Word, etc.) generally doesn’t exhibit the levels of parallelism that scientific computing often has. That is part of the reason why vector computers remained only in the supercomputing market, and was not integrated into commodity processors. Additionally, in the past it has been too expensive (either in terms of components or silicon area) to support vector operations in general purpose CPUs. More recently (over the past decade), relatively powerful processors have opened up new application areas to the mass market, such as audio and video manipulation (listening to MP3s, watching videos), high end graphics (photoshop, video games), and other areas that have higher levels of parallelism. For example, for many image processing tasks, the individual output pixels can be computed in parallel. These applications have many small, tight inner loops in their codes, many of which can be transformed into more vector-like operations. Intel introduced an extension to their x86 ISA called MMX (originally “Matrix Manipulation eXtensions”, and then later rechristened as “Multi-Media eXtensions” by the marketing folks). The MMX extensions defines some new registers for holding vector data, as well as new vector instructions for manipulating the vector data. In the original MMX definition, eight new registers were added to the x86 architecture: MM0-MM7. Each register is 64-bits wide, and can hold either a single 64-bit value, two 32-bit values, four 16-bit values, or eight 8-bit values. So the scope of vectorization is somewhat less than what was used in the CRAY-1. The new instructions operate of the MM* registers. Moving data into and out of the MM* registers requires MOVD and MOVQ instructions (for move double- and move quad-word). The values may be moved from memory to MM*







































T00-T

B00-B

VL

VM RTC

S

S

A

A

Add

Logical

Shift

Recip. Ap. Add

Multiply

Add

Logic

Shift

Pop Count

Add

Multiply

V

V

Instruction Buffers

Address Registers

Scalar Registers

Vector Registers

NIP CIP

LIP

To Execution

Vector Units

Scalar Units

Address Units

FP Units

16-Way Interleaved Memory

1

PC

Figure 3: Block diagram of the CRAY-1 computer.