





Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Computer taxonomies and focuses on vector processing, a method used in computer systems to handle multiple data streams concurrently. Vector processors use single or multiple instruction streams to manipulate single or multiple data streams. The advantages and disadvantages of vector processing, including improved code density, reduced number of instructions, and data dependencies. It also covers vector registers, functional units, and the organization of data in vector processors.
Typology: Study notes
1 / 9
This page cannot be seen from the preview
Don't miss anything!






There are many possible ways to describe computer systems, and many different taxonomies or categorization systems have been proposed. The most common is due to Flynn, called Flynn’s Taxonomy. This system divides machines up depending on (1) how many instruction streams there are and (2) how many data streams there are. By instruction streams, we mean separate and independent paths of control and state, that is an instruction stream has its own program counter, and its own registers. By data streams, we mean independent pieces of data. A vector of multiple pieces of data may comprise multiple data streams, and independent memory values that are handled concurrently may also comprise separate data streams. Each of the two attributes, I nstructions and D ata, may be classified as S ingle or M ultiple. This yields four possible combinations:
Data Instruction Streams Streams Single Multiple Single SISD SIMD Multiple MISD MIMD
Note that these are datastreams that are exposed to the programmer. Someone writing a program for a MIMD machine must explicitly write more than one thread of execution (instruction stream) and make sure that the different data handled in parallel are done so correctly. In a superscalar processor, multiple instructions may execute concurrently working on distinct sets of data, but all of this occurs “beneath the covers” and is something that the programmer can not see. Why do we bother about taxonomies? The primary interest is that the categorization of systems can help understand many of the pertinent issues of systems. Not just only in computer systems, but taxonomies in general assist in evaluating different attributes in an organized manner which may shed some additional insight on tradeoffs or relative importance of different parameters. Flynn’s taxonomy is perhaps one of the most well known, and therefore commonly used. This is probably due to its simplicity. Unfortunately, it is often a little too coarse, and so other refinements or extensions to Flynn’s taxonomy
have been introduced over time. This lecture we focused on SIMD systems, starting with one of the grand-fathers, the Cray-1 supercomputer, and then looking at MMX/3DNow!/SSE which are some of the more modern reincarnations of the SIMD technology that was used back in the 70’s for the Cray.
The pipelined and superscalar processors that we have been covering in this course are all uniprocessors , which are processors that have only a single thread of execution or simply a single program counter. Vector processors are also uniprocessors, and are in many way similar to more traditional SISD machines. There is a single program counter, and there are typically registers and functional units and so on. The difference is that in a vector machine, there are typically additional vector registers , vector functional units , and vector instructions to manipulate the vector data. A vector register may be composed of a large number of smaller scalar registers. Each scalar register is an independent number, traditionally interpreted as a floating point value (you can interpret it however you want since it’s all just a bunch of bits). Vector computations are more common for scientific computing where you may be modelling something or performing some mathematical computation (over continuous variables and functions), and so that is why floating point is typically used. High throughput (pipelined) vector functional units are used to process data from the vector registers at a high rate. To command the computer to perform these operations, special vector instructions must be used. A typical vector operation would be: −→ A =
In a non-vector machine, such code would have to be compiled into a loop that iteratively adds the components of the vectors together in a point-wise fashion. In a vector machine, this is accomplished with a single instruction:
VOP Vdest Varg 1 Varg 2
Where VOP specifies a vector operation (such as add or multiply), and the V∗ arguments name vector registers. The advantages of vector instructions are:
Since a single (or a few) instructions can replace the entire body and control overhead of a loop, the number of instructions needed to specify how to perform a computation is potentially reduced. This increase in code density (roughly the number of instructions to perform some task or work) allows more useful “work” to fit inside the cache which results in fewer cache misses. Holding all other factors constant, reducing the number of instructions needed for a computation will decrease the runtime of the program (increase performance). Naturally all other factors are not exactly “constant”, but this can still help. The vector organization of data is very helpful since it implies that there are no data dependencies when instructions operate on the data. This means that the hardware can be much simpler than the superscalar logic for checking resource availability, inter-instruction dependencies, pairing rules, etc., while achieving a fair amount of parallelism. We saw that the control dependencies due to loops can cause degradations in performance due to interruptions in fetching instructions and the need for branch prediction. Furthermore, in a superscalar processor, the parallelism is hampered by the fact that the instructions can only be fetched into the core of the processor at a limited rate. A loop that is replaced by a few vector instructions basically allows every iteration of the loop to be fetched at the same time, thus exposing a large number of parallel operations that can be concurrently executed. The downsides of vector processing are:
V2n− 1 V3n− 1
V4 1 V4n− 1 V1n− 1 VMUL
Figure 1: If there are enough functional units and register ports, vector instructions may be overlapped to further increase concurrency.
this creates is that 4 read ports from the vector registers are actually needed for this example. If, say, there are only 2 read ports, then the VMUL will not be able to proceed until after the last element of the preceding VADD instruction has run. This is very similar to data forwarding/bypassing, but in vector processing lingo, it is called chaining. A vector load instruction may cause timing issues since a particular element of the vector may not reside in memory (page miss). Because of this, some implementations do not allow the overlap of vector computations with an earlier dependent vector load. That is, the vector load must completely read in all values into the vector register before any further vector operations on that register may proceed. There are a few other instructions that may operate on vector data, but generate scalar results. One example is vector compare or test, where all elements of a vector are pair-wise compared with another vector, and the result is a vector of bits that store the results of each comparison. If there are 64 elements to a vector, than a single 64-bit scalar value can store the result of the vector comparison. Another common vector → scalar computation is the accumulate operation where the sum of all of the elements are stored in a single scalar register, or the sum of the pair wise product of two vectors is stored in a scalar register (the dot product). There is an additional issue of what to do with vectors that are not long enough to use up all of the elements in the vector register. One simple approach is to pad the remaining elements with zeros, and the result of any operation is simply as defined. If two vectors of different lengths are added, the “extra” items of the longer vector remain the same. If the two vectors are multiplied though, the extra items become zero. Another approach is to fill the extra elements with some invalid value, such as the IEEE NaN (Not a Number). The result of any operation that has a NaN as an argument is also a NaN. In this manner, the result of either addition or multiplication (in our example) is that the extra items in the result vector will always by NaN.
2.2.1 Physical Construction
A great quote from an old article from the Communications of the ACM about the Cray-1:
The CRAY-1 has been called “the world’s most expensive love-seat”. Certainly, most people’s first reaction to the CRAY-1 is that it is so small.”
Now take a look at the physical organization of the machine shown in Figure 2. The quote really puts things into perspective. These days, we are used to having machines that fit on or under our desks, and even a tower case for a server may seem quite “large” to us. On the other hand, the CRAY-1 is about 6 and a half feet tall and over 8. feet in diameter! The system is organized as 12 wedge shaped columns, arranged in a circular fashion encompassing 270 o. This was done to minimize the distances of the wires going between the different modules. The last quadrant was left open so that a person could get access to the modules inside the computer to either connect devices or for repair/diagnostic reasons. In the words of the article:
This leaves room for a reasonably trim individual to gain access to the interior of the machine.
The interesting “vital stats” of the CRAY-1 are as follows:
19”
77”
56.5”
103.5”
Figure 2: The physical dimensions of the CRAY-1.
The basic architecture includes 16 memory banks (interleaved/banked memory), 12 functional units, and more than 4KB worth of register storage (both scalar and vector). The other key implementation features of the CRAY-1 are that only four chip (IC) types are used, the speed of the main memory, the cooling system, and the computational core of course.
2.2.2 Four IC’s
All logic in the CRAY is completely implemented with a mix of high- and low- speed 4- and 5-input NAND gates. Registers are implemented with 16 × 4 bit chips (with 6ns access time), and memory is implemented with 1024 × 1 bit memory chips (with 50ns access time). The overall clock cycle time is 12.5ns (80MHz), which is incredibly fast for the late 70’s (consider that in 1990, an i386 at 16MHz was pretty fast)! Each IC comes in a 16-pin flat package. The small size of the ICs were important since a single module (small PCB) may contain up to 288 ICs, and up to 72 such modules can be inserted into a 28” high chassis (there are 2 chassis
Main stream processing (IE, Word, etc.) generally doesn’t exhibit the levels of parallelism that scientific computing often has. That is part of the reason why vector computers remained only in the supercomputing market, and was not integrated into commodity processors. Additionally, in the past it has been too expensive (either in terms of components or silicon area) to support vector operations in general purpose CPUs. More recently (over the past decade), relatively powerful processors have opened up new application areas to the mass market, such as audio and video manipulation (listening to MP3s, watching videos), high end graphics (photoshop, video games), and other areas that have higher levels of parallelism. For example, for many image processing tasks, the individual output pixels can be computed in parallel. These applications have many small, tight inner loops in their codes, many of which can be transformed into more vector-like operations. Intel introduced an extension to their x86 ISA called MMX (originally “Matrix Manipulation eXtensions”, and then later rechristened as “Multi-Media eXtensions” by the marketing folks). The MMX extensions defines some new registers for holding vector data, as well as new vector instructions for manipulating the vector data. In the original MMX definition, eight new registers were added to the x86 architecture: MM0-MM7. Each register is 64-bits wide, and can hold either a single 64-bit value, two 32-bit values, four 16-bit values, or eight 8-bit values. So the scope of vectorization is somewhat less than what was used in the CRAY-1. The new instructions operate of the MM* registers. Moving data into and out of the MM* registers requires MOVD and MOVQ instructions (for move double- and move quad-word). The values may be moved from memory to MM*