Advanced Computer Architecture (B.C.A. Part-III) Nitika Newar, MCA Deptt. of I.T. Biyani Girls College, Jaipur
Syllabus B.C.A. Part-III Advanced Computer Architecture Parallel Computer Models : The state of computing, multiprecessors and multicomputers, multivector and SIMD computers, architectural development tracks. Program and Network Properties : Conditions of parallelism, program partitioning and scheduling, program flow mechanisms. System Interconnect Architectures : Network properties and routing, Static interconnection network and dynamic intercommection networks. Processors and Memory Hierachy : Advanced processor technology—CISC, RISC, Superscalar, Vector VLIW and symbolic processors, memory technology. Bus, Cache and Shared Memory. Linerer Pipeline Processors, Nonlinear Pipeline, processors Instruction pipeline Design Multiprocessors System Interconnets Vector Processing Principles, Multivector Multiprocessors.
Content S. No. Name of Topic Page No. 1. Parallel Computer Models 9-29 1.1 Multiprocesors 1.2 Parallel processing 1.3 State of computing 1.4 History of computer Architecture 1.5 Parallelism 1.6 Levels of Paralleism 1.7 Vector super computers 1.8 Shared memory multiprocessor 1.9 Distributed memory multicomputers 1.10 SIMD computers 1.11 Architectural Development Tracks 1.12 SIMD array processor 2. Program partitioning or scheduling 30-33
2.1 Program Flow Mechanisms 2.2 Data flow Architecture 2.3 Grain Sizes & Latency 2.4 Scheduling procedure 3. System Interconnect Architecture 34-42 3.1 Network properties 3.2 Bisection width 3.3 Data routing functions Chapter 1 Parallel Computer Models Q.1. What is multiprocessors? What are the types of multi-processors? Ans. A multiprocessor system is an inter connection of two or more CPUs with memory as input-output equipment. A multiprocessors are classified as multiple instruction stream, multiple data stream (MIMD) systems. There are some similarties between multiprocessor & multicomputer systems since both support concurrent operations. However there exists an important distinction between a system with multiple computers & a system with multiple processors. Computers are interconnected with each other by means of communication lines to form a computer network. The network consists of several autonomous computers that may or may not communicate with each other. A multiprocessor system is controlled by one operating system that provides inter-connection between processors& all the components of the system cooperate in the solution of a problem very large scale integrated circuit technology has reduced the cost of computer components to such a low level that the concept of applying multiple processors to meet system performance requirements has become an attractive design possibility. Multiprocessing improves the reliability of system so that a failure or error in one part has limited effect on rest of system. If a fault causes one processor to fail, a second processor can be assigned to perform the functions of disabled processor. The system as whole can continue to function correctly with perhaps some loss in efficiency. The benefits derived from a multiprocessor organisation is an improved system performance. The system derives its high performance from the fact that computations can proceed in parallel in one of two ways: 1. Multiple independent jobs can be made to operate in parallel. 2. A single job can be partitioned into multiple parallel tasks. An example is a computer system where are processor performs the computations for an industrial process control while others monitors control various parameter such as temperature and flow rate. Another example is a computer where are processor performs high speed floating point mathematical computations and another take care of routine data processing tasks. Multiprocessing can improve performance by decomposing a program into parallel executable tasks. This can be achieved in one of two ways:
The user can explicitly declare that certain tasks of the program be executed in parallel. This must be done prior to load the program by specifying the parallel executable segments. Most multiprocessor manufacturers provide an operating system with programming language construct suitable for specifying parallel processing. The other, more efficient way is to provide a compiler with multiprocessor software that can automatically detect parallelism in a users’s program. The compiler checks for data dependency in the program. If a program depends on data generated in another part, the part yielding the needed data must be executed first. However two parts of a program that do not use data generated by each can run concurrently. The parallelizing compiler checks the entire program to detect any possible data dependence. These that have no data dependency are then considered for concurrent scheduling on different processors. Multi processors are classified by the way their memory is organized. A multiprocessor system with common shared memory is classified as shared memory or tightly coupled multiprocessor. This does not preclude each processor from having its own local memory. In fact, most commercial tightly coupled multiprocessor provide a cache memory with each CPU. In addition there is a global common memory that all CPUs can access. Information can therefore be shared among the CPU by placing it in the common global memory. An alternative model of microprocessor is the distributed memory or loosely coupled system. Each processor element in a loosely coupled system has its own private local memory. The processors are tied together by a switching scheme designed to route information from one processor to another through a message passing scheme. The processors relay program is data to other processors in packets. A packet consists of an address, the data content and some error detection code. The packets are addressed to a specific processor or taken by first available processor, depending on the communication system used. Loosely coupled systems are most efficient when the interaction between tasks is minimal, whereas tightly coupled systems can tolerate a higher degree of interaction between tasks. Q.2. What is parallel processing? Ans. Parallel processing is a term used to denote a large class of techniques that are used to provide simultaneous data processing tasks for the purpose of increasing the computational speed of computer system. Instead of processing each instruction sequentially as in conventional computer, a parallel processing system is able to perform concurrent data processing to achieve faster execution time. The purpose of parallel processing is to speed up the computer processing capability and increase its throughput, that is, the amount of processing that can be accomplished during a interval of time. Parallel processing at higher level of complexity can be achieved by having multiplicity of functional units that perform identical or different operations simultaneously. Parallel processing is established by distributing the data among the multiple functional units. For example the arithmetic logic and shift operationscan be separated into three units
and the operands diverted to each unit under the supervision of control unit. Singe Instruction stream – Single Data Stream (SISD) Single Instruction Multiple Data Stream (SIMD) CU: Control Unit PU: Processing Unit MM: Memory Module These are variety of ways that parallel processing can be classified. One classification introduced by M.J. Flynn considers the organization of computer system by number of instructions and data items that are manipulated simultaneously. The normal operation of a computer is to fetch instructions from memory and execute them in the processor. The sequence of instructions read from memory constitutes an instruction stream. The operations performed on the data is processor constitutes a data stream parallel processing may be occur in the instruction stream, in data stream or both. IS CU IS PU DS MM PU1 DS1 MM1 PU2 DS2 MM2 CU PUn DSn MMn IS Single instruction stream, single Data stream (SISD) Single instruction stream, multiple data stream (SIMD) Multiple instruction stream, single data stream (MISD) Multiple instruction stream, multiple data stream (MIMD) Multiple Instruction Stream Single Data Stream (MISD) Multiple Instream stream Multiple Data Stream (MIMD) Q.3. Explain the state of computing? Ans. Modern computers are equipped with powerful hardware facilitates driven by extensive software packages. To asses the state of computing we first review historical milestones in the development of computers.
CU1 IS1 PU1 CU2 PU2 CUn ISn MMn IS2 MM1 MM2 IS1 IS2 IS3 PUn DS ISn IS1 IS2 CU1 PU1 IS1 DS1 MM1 MM2 MMn DSn CU2 PU2 DS2 CUn ISn PUn ISn IS2 IS1 Computer Generations Over the past five decades, electronic computers have gone through fine generations of development. Each of first three generations lasted about 10 years. The fourth generations covered a time span of 15 years. We have just entered the fifth generations with the use of processors & memory devices with more than 1 million transistors on single silicon chip. The table indicates the new hardware and software features introduced with each generation. Most features introduced in earlier generations have been passed to later generations. In other words, the Five Generations of Electronic Computers Generation Technology & Architecture Software & Application Representative System First (1945-54) Second (1955-64) Third (1965-74) Fourth (1975-90)
Fifth (1991 present) Vaccuum tubes & relay memories, CPU driven by Pc & accumulator, fixed point arithmetic. Discrete transistors and core memories, floating point arithmetic, I/O processors, multiplexed memory access. Integrated circuits (SSI-MSI), microprogramming, pipelining, cache & lookahead processors. LSI/VLSI & semi conductor memory, multiprocessors, vector supercomputers, multi computers. ULSI/VHSIC processors, memory & switches, high density packaging, scalable architectures. Machine/assembly languages, single user, no subroutine linkage, programmed I/O using CPU. HLL used with compilere, subroutine libraries, batch processing monitor. Multiprogramming & time sharing OS, multi user applications. Multiprocessor OS, languages, compilers & environment for parallel processing. Massively parallel processing, grand challenge applications, heterogenous processing. ENIAC, Princeton, IAS, IBM 701 IBM 7090, CDC 1604, Univac LARC. IBM 360/370, CDC 6600, TI- ASC, PDP-8 VAX 9000, Gay XMP, IBM 3090 BBN
TC 2000 Fujitsu VPP 500, Gay/MPP, TMC/CM-5, Intel paragon. latest generation computers have inherited all the bad ones found in previous generations. Q.4. How is computer Architecture developed? Ans. Over the past four decades, computer architecture has gone through evolutional rather than revolutional changes sustaining features are those that were proven performance delivers. According to the figure we started with the Von Neumann architecture built as a sequential machine executing scalar data. Sequential computers improved from bit serial to word-parallel operations & from fixed point to floating point operations. The Von Neumann architecture is slow due to sequential execution of instructions in programme. Lookahead, Paralleism and Pipelining : Lookahead techniques were introduced to prefetch instructions in order to overlap I/E (instruction fetch/decode and execution) operations and to enable functional parallelism. Functional parallelism was supported by two approaches: One is to use multiple functional units simultaneously and the other is to practice pipelining at various processing levels. The latter includes pipelined instruction execution, pipelined arithmetic computations and memory access operations. Pipelining has proven especially attractive in performing identical operations repeatedly over vector data strings. Vectors operations were originally carried out implicitly by software controlled looping using scalar pipeline processors. Flynn’s Classification: Michael Flynn (1972) introduced a classification of various computer architectures based on notions of instruction and data streams. Conventional sequential machines are SISD (single instruction stream over a single data stream). Vector computers are equipped with scalar and vector hardware or appear as SIMD (single instruction stream over multiple data streams). Parallel computers are reserved as MIMD (multiple instruction streams over multiple data streams) machines. Q.5. What is Parallelism? What are the various conditions of parallelism Ans. Parallelism is the major concept used in today computer use of multiple functional units is a form of parallelism within the CPU. In early computer only one arithmetic & functional units are there so it cause only one operation to execute at a time. So ALU function can be distributed to multiple functional units, which are operating in paralle. H.T. Kung has identified that there is a need to move in three areas namely computation model for parallel computing, inter process communication in parallel architecture & system integration for incorporating parallel systems into general computing environment. Conditions of Parallelism : 1. Data and resource dependencies : A program is consist of several segments,
so the ability of executing several program segment in parallel requires that each segment should be independent other segment. Dependencies in various segment of a program may be in various form like resource dependency, control depending & data depending. Dependence graph is used to describe the relation. Program statements are represented by nodes and the directed edge with different labels shows the ordered relation among the statements. After analyzing dependence graph, it can be shown that where opportunity exist for parallelization & vectorization. Data Dependencies: Relation between statements is represented by data dependences. There are 5 types of data dependencies given below: (a) Antidependency: A statement S2 is antidependent on statement S1 if S2 follows S1 in program order and if the output of S2 overlap the input to S1. (b) Input dependence: Read & write are input statement input dependence occur not because of same variables involved put because of same file is referenced by both input statements. (c) Unknown dependence: The dependence relation between two statement cannot be determined in following situation The subscript of variable is itself subscribed. The subscript does not contain the loop index variable. Subscript is non linear in the loop index variable. (d) Output dependence: Two statements are output dependence if they produce the same output variable. (e) Flow dependence: The statement S2 is flow dependent if an statement S1, if an expression path exists from S1 to S2 and at least are output of S, feeds in an input to S2. 2. Bernstein’s condition : Bernstein revealed a set of conditions depending on which two process can execute in parallel. A process is a program that is in execution. Process is an active entity. Actually it is an obstraction of a program fragment defined at various processing levels. Ii is the inputset of process Pi which is set of all input variables needed to execute the process similarly the output set of consist of all output variable generated after execution of all process Pi. Input variables are actually the operands which are fetched from the memory or registers. Output variables are the result to be stored in working registers or memory locations. Let there are 2 processes P1 & P2 Input sets are I1 & I2 Output sets are O1 & O2 The two processes P1 & P2 can execute in parallel & are directed by P1/P2 if & only if they are independent and do not create confusing results. 3. Software Parallelism : Software dependency is defined by control and data dependency of programs. Degree of parallelism is revealed in the program profile or in program flow graph. Software parallelism is a function of algorithm, programming style and compiler optimization. Program flow graphs shows the pattern of simultaneously executable operation. Parallelism in a program varies
during the execution period. 4. Hardware Parallelism : Hardware Parallelism is defined by hardware multiplicity & machine hardware. It is a function of cost & performance trade off. It displays the resource utilization patterns of simultaneously executable operations. It also indicates the performance of the processor resources. One method of identifying parallelism in hardware is by means by number of instructions issued per machine cycle. Q.6. What are the different levels of parallelism : Ans. Levels of parallelism are described below: 1. Instruction Level : At instruction level, a grain is consist of less than 20 instruction called fine grain. Fine grain parallelism at this level may range from two thousands depending an individual program single instruction stream parallelism is greater than two but the average parallelism at instruction level is around fine rarely exceeding seven in ordinary program. For scientific applications average parallel is in the range of 500 to 300 fortran statements executing concurrently in an idealized environment. 2. Loop Level : It includes iterative loop operations. A loop may contain less than 500 instructions. Some loop independent operation can be vectorized for pipelined execution or for look step execution of SIMD machines. Loop level parallelism is the most optimized program construct to execute on a parallel or vector computer. But recursive loops are different to parallelize. Vector processing is mostly exploited at the loop level by vectorizing compiler. 3. Procedural Level : It corresponds to medium grain size at the task, procedure, subroutine levels. Grain at this level contains less than 2000 instructions. Detection of parallelism at this level is much more difficult than a finer grain level. Communication requirement is much less as compared with that MIMD execution mode. But here significant efforts are required by the programmer to restructure a program at this level. 4. Subprogram Level : Subprogram level corresponds to job steps and related subprograms. Grain size here contains less than 1000 instructions. Job steps can overlap across different jobs. Multiprogramming an uniprocessor or multiprocessor is conducted at this level. 5. Job Level : It corresponds to parallel executions of independent jobs on parallel computer. Grain size here can be tens of thousands of instructions. It is handled by program loader and by operating system. Time sharing & space sharing multiprocessors explores this level of parallelism. Q.7. Explain Vector super computers? Ans. Program & data are first loaded into the main memory through a host computer. All instructions are first decoded by the scalar control unit. If the decoded instruction is a scalar operation or program control operation it will be directly executed by scalar processor using the scalar functional pipelines. If the instruction is decoded as a vector operation, it will be sent to the vector control unit. This control unit will supervise the flow of vector data between the main memory & vector functional pipelines. The vectordata flow is coordinated by control unit. A number of vector functional pipelines may be built into a
vector processor. Computers with vector processing capabilities are in demand in specialized applications. The following are representative application areas where vector processing is of utmost importance. Long Range weather forecasting Petroleum explorations Medical diagnosis Space flight simulations Vector Processor Models The Architecture of vector super computer Q.8. What are the different shared memory multiprocessor models? Ans. The most popular parallel computers are those that execute programs in MIMD mode. There are two major classes of parallel computers: shared memory multiprocessor & message – passing multi computers. The major distinction Scalar Processor Scalar Functional Pipelines Scalar Instructions Scalar control unit Instructions Main menory Scalar data Mass storage Host computer Vector registers Vector func. pipe Vector function pipe Control Vector control unit Vector processor
between multiprocessors & multicomputers lies in memory sharing and the mechanisms used for interprocessor communication. The processor in multiprocessor system communicate with each other through shared variable in a common memory. Each computer node in a multicomputer system has a local memory, unshared with other nodes. Inter process communication is done through message passing among nodes. There are three shared memory multiprocessor models:- 1. Uniform memory access (UMA) model 2. Non-uniform memory access (NUMA) model 3. Cache only memory Architecture (COMA) model These models are differ in how the memory & peripheral resources are shared or distributed. 1. UMA Model: The UMA multiprocessor model In this model the physical memory is uniformly shared by all the processors. All
processors have equal access time to all memory words, which is why it is called uniform memory access. Each processor may use a private cache. Peripherals are alos shared. Multi processors are called tightly coupled systems due to high degree of resource sharing. UMA model is suitable for time sharing applications by multiple users. It can be used to speed up the execution of single large program in time critical application.When all processors have equal access to all peripheral devices, the system is called a symmetric multiprocessor. In this case, all the processors are equally capable of running executive programme, such as kernel. In an asymmetric multiprocessor, only one or subset of processors are executive capable. An executive or master processor can execute the operating system and P1 P2 Pn System Interconnect (Bus, Crossbar, Multistage network) I/O SM1 SMm Shard manery Processor
handle I/O. The remaining processors called attached processors (AP) executes user code under the supervision of master processor. 2. NUMA model: A NUMA multiprocessor is a shared memory system in which the access time varies with the location of memory word. Two NUMA machine models are depicted. The shared memory is physically distributed to all processors, called local memories. The collection of all local memories forms a global address space accessible by all processors. It is faster to access a local memory with a local processor. The access of remote memory attached to other processors takes longer due to the added delay through the interconnection network. Shared Local Memories In the hierarchial cluster Model processors are divided into several clusters. Each cluster may be UMA or NUMA Each cluster is connected to shared memory modules. All processors of a single cluster uniformally access the cluster shared memory modules. All cluster equally access to global memory access time to cluster memory is shorter then that of global memory. A hierarchical cluster models LM1 LM2 LMn P1 P2 Pn Inter Connection Network GSM GSM GSM Global Interconnect Network P1 P2
Pn CSM CSM CSM P1 P2 Pn CSM CSM CSM C I N C I N
3. Cache Only Memory Architecture: This model is a special case of NUMA machine where distributed main memories are replaced with cache memory. At individual processor node, there is no memory hierarchy. All cache made a global address space. Depending on interconnection network used, directories may be used to help in locating copies of cache blocks example of COMA includes Swedish Institute of Computer Science’s Data Diffusion machine (DDM). Q.9. What is Distributed Memory Multicomputers? Ans. A system consist of multiple computers, often called nodes, interconnected by a message passing network. Each node is autonomous computer consisting of a processor, local memory and sometimes attached disks or I/O peripherals. All local memories are private & accessible only by local processors. This network provides point-to-point static connection among nodes. Inter node communication is carried out by passing messages through the static connection network. Interconnection Network D C P D C P D C P M P M P M P M P M P M P M P M P M P M P Message passing interconnection network (mesh, ring etc.)
Q.10. Explain SIMD Computers? Ans. SIMD means single instruction stream and multiple data stream. These computers are array processors. There are multiple processing elements which are supervised under same control unit. Each processing element receives same instruction but operate on different data from distinct module, SIMD Machine Model An operational model of an SIMD computer is specified by 5- Triple. M = [ N, C, I, M, R] Where (1) N is the number of processing elements (PEs) in the machine. (2) C is the set of instructions directly executed by control unit including scalar and program flow control instructions. (3) I is set of instructions broadcast by CPU to all PEs for parallel execution. These include arithmetic, logic, data routing, masking and other local operations executed by each active PE over data within that PE. (4) M is the set of masking schemes, where each mask partitions the set of PEs into enabled & disabled subsets. (5) R is the set of data routing functions, specifying various patterns to be set up in the inter connection network for inter PE communications. Q.11. What are the Architectural development tracks? Ans. Architecture of todays computers follows development tracks. There are mainly 3 tracks. These tracks are distinguished by similarity in computational model & technological bases. 1. Multiple Processor tracks: As we know multiple processor system can be shared memory or distributed memory. (a) Shared Memory track: Fig. Shared Memory track Standard/Dash Fujitsu VPP 500 CMU/ KSR 1 C.mmP Itlinosis cedar NYU/ Ultra Computer IBM RP3 BBN Butterfly It shows track of multiprocessor development employing a single address space in the entire system c. mmp was a UMA multiprocessor. The c.mmp project poincered shared memory multiprocessor development not only in the cross architecture but in multiprocessor operating system development. Illinois Codar project and NYO ultra computer project both were developed with a single address space. Both use multi stage network as system inter connect. Standard Dash is a NUMA multiprocessor with distributed memory forming a global address space cache coherence is there with distributed directories. KSR-1 is a COMA model. Fujitsu UPP 500 is processor system with a cross bar inter connected shared memories are distributed to all processor nodes.
(b) Message Passing track: (2) Multivector & SIMD tracks Multivector track The CDC 7600 was first vector dual processor system. There are 2 subtracks derived from CDC-7600. The latest cray/mpp is a massively parallel system with distributed shared memory. (b) SIMD track 3. Multi threaded and Dataflow tracks: In case of multi threaded system, each processor can execute multiple context at the same time. So multiple threading means there are multiple threads of control Cosmic Cuben CUBE – 2/6400 Inter iPsc’s Intel paragon Mosaic MIT/J Machine CDC Cyber 205- ETA 10 Cray 1 CDC 7600 Cray Y- mp Cray/m PP Fujitru, NEC, Hitachi Mode Illiac IV Goodyear MPP DAP 610 BSP CM5 Mas Par MP1 IBM GF/11 in each processor. So multi threading hides long latency in building large scale multiprocessors. This track has been experimented in laboratories. Multi threaded track Data Flow Track Q.12. What are the two configurations of SIMD array processor. Ans. Synchronous array of parallel processors is called array processor, which consist of multiple processing element (PES). SIMD array processor have 2 configurations Configuration I Configuration I (Illiac IV) First – SIMD array configuration which are introduced in Illiac – IV computer. This is having N synchronized PEs, all are under the control of one CU. Each PE is an arithmetic logic unit (ALU) with attached working register and local memory PEM for storage. Control unit has its own memory for storage of programs. First user programs are loaded into CU memory and then CU decode all the instructions and determine where the decade instructions should be executed. Scalar and control instructions are executed inside CU and vector instructions are broadcast to the PE for the distributed execution. CDC 600 HFP Tera MIT/Alenrife Static Data flow MIT tagged token Mamchester
I/O Data & Instruction Data bus CU Memory CU PE0 PE1 PEN PEM 0 PEM1 PEMn Interconnection Network Control Configuration II Configuration II (BSP) Main differences in configuration I and II is in 2 aspects. First the local memories are attached to the PEs are replaced by parallel memory module shared by all the PEs through an alignment network. Second, inter PE network is replace by the inter PE memory alignment network, which is controlled by CU. Example of configuration II is Burrough Scientific processor (BSP). There are N PEs and P memory modules in configuration II. These two numbers (N and P) are not equal and are relatively prime. The alignment network is a path switching network between PEs and parallel memories. I/O Control CU Memory CU PE0 PE1 PEn-1 Alignment Network M0 M1 Mn-1
Chapter 2 Program Partitioning or Scheduling Q.1. What are program flow mechanisms? Ans. Conventional computers are based on control flow mechanism by which the order of program execution is explicitly stated in the user program. Data flow computers have high degree of parallelism at the fine grain instruction level reduction computers are based on demand driven mechanism which initiates operation based on the demand for its result by other computations. Data flow & control flow computers : There are mainly two types of computers. Data flow computers are connectional computer based on Von Neumamm machine. It executes instructions under program flow control whereas control flow computer, executes instructions under availability of data. Control flow Computers : Control Flow computers use shared memory to hold program instructions and data objects. Variables in shared memory are updated by many instructions. The execution of one instruction may produce side effects on other instructions since memory is shared. In many cases, the side effects prevent parallel processing from taking place. In fact, a uniprocessor computer is inherently sequential due to use of control driven mechanism. Data Flow Computers : In data flow computer, the execution of an instruction is driven by data availability instead of being guided by program counter. In theory any instruction should be ready for execution whenever operands become
available. The instructions in data driven program are not ordered in any way. Instead of being stored in shared memory, data are directly held inside instructions. Computational results are passed directly between instructions. The data generated by instruction will be duplicated into many copies and forwarded directly to all needy instructions. This data driven scheme requires no shared memory, no program counter and no control sequencer. However it requires special mechanism to detect data availability, to match data tokens with needy instructions and to enable the chain reaction of asynchronous instructions execution. Q.2. Explain data flow architecture? Ans. There are quite a few experimental data flow computer projects. Arvind and his associates at MIT have developed a tagged token architecture for building data flow computers. The global architecture consists of n processing elements (PEs) inter connected by an n x n routing network. The entire system supports pipelined data flow operations in all n PEs. Inter PE communications are done through the pipelined routing network. Within each PE, the machine provides a low level token matching mechanism which dispatches only those instructions whose input data are already available. Each datum is tagged with the address of instruction to which it belongs and context in which the instruction is being executed. Instructions are stored in program memory. Tagged tokens enter the PE through a local path. The tokens can also be passed to the other PE through the routing network. All internal circulation operations are pipelined without blocking. Interior Design of a Processing Element One can think of instruction address in a dataflow computer as replacing the program counter & the context identifier replacing the frame base register in Global path nxn Routing Network PE' PE2 PEn the global architecture From Routing Network Local Path Token Match Program memory Compute Tag ALU Form Token 1-Structure X x
control flow computer. It is the machine job to match up data with same tag to needy instructions. In so doing, new data will be produced with a new tag indicating the successor instructions. Thus each instruction represents a synchronization operation. New tokens are formed and circulated along the PE pipeline for sense or to other PEs through global path, which is also pipelined. Q.3. Explain Grain Sizes and Latency. Ans. Grain Size or granularity is a measure or the amount of computation involved in
a software process. The simplest measure is to count the number of instructions in a given (program segment). Grain size determines the basic program segment chosen for parallel processing. Grain sizes are commonly described as fine, medium or coarse, depending on the processing levels involved. Latency is a time measure of the communication overhead incurred between machine subsystems for example the memory latency is the tune required by processor to access the memory. The time required for two processes to synchronize with each other is called synchronization latency, computational granularity and communication latency are closely related. Q.4. How can we partition a program into parallel branches, program modules, microtasks or grains to yield the shortest possible execution time? Ans. There exists a tradeoff between parallelism and scheduling overhead. The time complexity involves both computation and communication overheads. The program partitioning involves the algorithm designer, programmer, compiler, operating system support etc. The idea of grain packing is to apply five grain first in order to achieve a higher degree of parallelism. Then one combines multiple fine grain nodes into a coarse grain node if it can eliminate unnecessary communications delays or reduces the overall scheduling overhead. Usually, all five grain operations within a single coarse, grain node are assigned to some processor for execution. Fine grain partition of a program often demands more inter processor communication than that required in a course grain partition. Thus grain packings offers a tradeoff between parallelism and scheduling. Internal delays among fine grain operations within the same coarse grain node are negligible because the communication delay is contributed mainly by inter processor delays rather than by delays within the same processor. The choice of optimal grain size is meant to achieve the shortest schedule for the nodes on a parallel computer system. Chapter 3 System Interconnect Architecture Q.1. Explain different network properties? Ans. The topology of an interconnection network can be either static or dynamic. Static networks are formed of point-to-point direct connections which will not change during program execution. Dynamic networks are implemented with switched channels, which are dynamically configured to match the communication demand in user programs. Static networks are used for fixed connections among sub systems of a centralized system or multiple computing nodes of a distributed system. Dynamic networks include buses, crossbar switches, multistage networks, which are often used in shared memory multi processors. Both types of network have also been implemented for inter PE data routing in SIMD computers. In general, a network is represented by graph of finite number of nodes linked by directed or undirected edges. The number of nodes in the graph is called the network size.
Node Degree and Network Diameter : The number of edges incident on a node is called the node degree d. In the case of unidirectional channels, the number of channels into a node is the ‘in; degree and that out of a node is the ‘out’ degree. Then the node degree is the sum of the two. The node degree reflects the number of I/O ports required per node and their the cost of a node. Therefore, the node degree should be kept a constant, as small as possible in order to reduce cost. A constant node degree is very much desired to achieve modularity in building blocks for scalable systems. The diameter D of a network is the maximum shortest path between any two nodes. The path length is measured by the number of links traversed. The network diameter indicates the maximum number of distinct hops between any two nodes, thus providing a figure of communication merit for the network. Therefore, the network diameter should be as small as possible from communication point of view. Q.2. What is Bisection Width? Ans. When a given network is cut into two equal halves, the minimum number of edges along the cut is called channel bisection width b. In the case of communication network, each edge corresponds to a channel with w bit wires. Then the wire bisection width is B = bw. This parameter B reflects the wiring density of a network. When B is fixed, the channel width w = B/b. Thus the bisection width provides a good indicator of maximum communication band width along the bisect ion of a network. All other cross sections should be bounded by bisection width. Q.3. What is Data routing functions? Describe some data routing functions? Ans. Data routing networks is used for inter PE data exchange. Data routing network can be static or dynamic. In multicomputer network data routing is achieved by message among multiple computer nodes. Routing network reduces the time required for data exchange and thus system performance is improved. Commonly used data routing functions are shifting, rotation, permutations, broadcast, multicast, personalised communication, shuffle etc. Some Data routing functions are described below: (a) Permutations: Let there are n objects, then there are nf permutations by which n objects can be recorded. Set of all permutations form a permutation group with respect to composition operation. Generallycycle notation is used to specify permutation function. Cross can be used to implement the permutation. Multi stage network can implement some of the permutations in one or multiple passes through the network. Shifting and broadcast operation are also used to implement permutation operation. Permutation capability of a network is used to indicate the data routing capacity. Permutation speed dominates the performance of data routing network, when n is large. (b) Hypercube routing function: Three dimensional cube is shown below: Routing functions are defined by three bits in the node address. Bit order is C2C1Co. Data can be exchanged between adjacent nodes which differs in the least significant bit Co as shown below. Routing by least significant bit, Co
Similarly routing pattern by using bit C1 & C2 is shown below: Routing by middle bit, C1 000 001 010 011 100 101 110 111 000 001 010 011 100 101 110 111 Routing by most significant bit, C2 General pattern tells that n-dimensional, cube has n-routing functions, which are defined by each bit of the n-bit address. These data routing functions are used in routing messages in a hypercube multi computer. (c) Broad cast & Multicast: Broad cast is one to all mapping. This is achieved by SIMD computers using a broadcast bus extending from array controller to all PEs. A mechanism is used to broadcast a message in message passing multi computer. Multicast means mapping from one subset to another. There is a variation of broadcast called personalized broadcast. Personalized broadcast sends messages to only selected receivers. Broadcast is a global operation in multi computer. Personalized broadcast may have to be implemented with matching of destination codes in the network. Q.4. What are static interconnection networks? Ans. Static interconnection networks has many topologies. These topologies are classified according to the dimensions required for layout. Example– one dimensions, 2 dimensional, 3- dimensional. One dimensional included linear array which is used for some pipelined architecture 2-dimensional includes topology: ring, star, mesh and systolic array. 3 dimensional includes – completely connected chordal ring, 3-cube and 3- cube connected cycle network. One dimensional topology Linear Array: In this N-nodes are connected by N-1 links in the line. In linear array, each internal node has degree equal to 2. Each external node or terminal node have degree equal to 1. Diameter is N -1, which is long for large value of N. The bisection width is b = 1. Linear arrays are actually simplest connection topology. Structure of linear assays are not symmetric. So when N is large, then communication inefficiency is there. 000 001 010 011 100 101 110 111 For small N, say N = 2, it is economic to implement a linear array. As diameter increases linearly with respect to N, it should not be used for large N. Linear array is different from bus which is time shared through switching among many nodes attached to it. Concurrent use of different sections of the structure of different source and destination pairs is allowed in linear arrays. Two Dimensional topology Star: Star is a two level tree with a high node degree of = N-1 and a small constant diameter of 2. Star topology is used in systems with a centralized supervisor node. Systolic array: For implementing fixed algorithm this type of pipelined array architecture is used. Figure of systolic array topology is shown:
0 1 2 3 0 0 0 0 7 6 5 4 8 9 10 11 15 14 13 12 0 0 0 0 0 0 0 0 0 0 0 0 O O O O O O O O O O O O O O O O O O O O O 12 0 1 2 3 4 5 6 7 8 9 10 O 11 Star topology Ring topology O O O O O O O O O
It is designed for matrix multiplication. Degree of interior nodes are equal to 6. Syntolic array are pipelined which multi dimensional flow of data streams. Systolic array matches the communication structure of the algorithm. It is similar to the fixed inter connection and synchronous ratio over special applications like signal/image processing. It is difficult to program and thus having limited applications. Three Dimensional Topology 3- Cube A 3- cube having B-nodes is shown below A A 4- cube is made by interconnecting the corresponding nodes of two 3-cubes. Node degree of n-cube is equal to n and so does the network diameter. Nodes degree increases linearly with respect to the dimensions, thus making hypercube difficult to use as a scalable architecture. Cube-Connected Cycles
Improved architecture of hypercube is the cube connected cycles. 3- cubes is modified to from 3-cube connected cycles (CCC). Thus, K-cube connected cycle can be constructed from the L-cube with n = 2x cycles nodes. Each vertex of K-dimensional hyper cube is replaced by a ring of Knodes thus a k- cube is translated into a K-CCC with K x 2K nodes. Q.5. Discuss various Dynamic Connection Network. Ans. For multipurpose and general purpose application we always use dynamic connections. Dynamic connection implements all common wrication patterns based on program demands. Fixed connections with switches on arbiters are used along the connecting path to provide the dynamic connectivity. Basically there are 2 classes of dynamic interconnection network-single stage and multistage network. Single Stage Network: Single stage network is a switching network with Ninput selectors (Is) and N output selectors (Os). Each input selector is 1- to –D demultiplexer and each output selector is an m – to -1 multiplexer where 1 = D = N and 1 = m = N. Single stage network is called recirculating network. Data items are recirculate through the single stage determines the number of O O O O O O O O
recirculations required. Crossbar switching network is a single stage network with D = M = N. For establishing the desired path different control signals are applied to all input selectors and output selectors. Multistage Network: Many stages of interconnected switchers forms a multistage MIMD network. Multistage network are characterized by 3 properties (a) Switch box (b) Network topology (c) Control structure Multistage network uses many switch boxes. Each switch is an interchange device with 2 inputs & 2-outputs. There are 4- stage of switch box straight, exchange, upper broadcast and lower broadcast. A two function switch box can be either in switch can be in any of four legitmate states. Switching box and their inter connection stages are given on next page. A multistage network connects an arbitrary input terminal to an arbitrary output terminal. There are two types of multi stage network: one sided or two sided. One sided network is called fill switches and they have input and output ports on the same size. Two-sided multistages network have an input side and output side divided into classes: Blocking, rearrangeable and blocking. IS O IS O IS 1 IS N-1 IS N-1
IS 1 o 1 N-1 o 1 N-1
A two by two switching box and its four interconnection states a0 a1 b0 b1 Switch box a0 a1 b0 b1 a0 a1 b0 b1 a0 a1 b0 b1 a0 a1 b0 b1 > >
Chapter 4 Processors and Memory Hierarchy Q.1. What is difference between RISC and CISC ? Ans. Properties CISC RISC 1. Number of instruction It vary from 120-350 It is below 100 used in set architecture 2. Instruction/data format Variable instruction and Fixed format instruction/ used data format is used. data is used. 3. Number of addressing Vary from 12-24 Vary from 3 to 5 modes 4. Number of general Vary from 8 to 24 Vary from 32-192 purpose registers used 5. Number of memory Large number of memory Less number of memory reference instruction used reference instructions. reference instruction are used. Only load and store are memory reference instruction. 6. Number of memory Large number of memory Less number of memory reference instruction used reference instructions. reference instruction are used. Only load
and store are memory reference instructions. 7. High level language Directly implemented in Not directly implemented in instructions hardware hardware. implementation 8. Execution efficiency Execution efficiency Execution efficiency is not increases. increases. 9. Control logic used Micro programmed Hardwired control unit is control unit is used. used. 10. Use of control memory Control memory is used No use of control memory. 11. Cache memory Unified cache memory Splited cache is used. is used. 12. Clock rate 35-50MHz 50-150 MHz Q.2. What is Super Scalar Processors? Ans. In a super scalar processor, multiple instructions are used, this implies that multiple instructions are issued per cycle and multiple results are generated per cycle. In simple scalar processor one instructions executes per cycle. Only one instruction is used per cycle and only one completion of instruction is expected per cycle. Super Scalar process are designed to exploit more instruction level parallelism. Super scalar operate basically in parallel. Super Scalar Architecture It require highly multipored register files. There input parts are required for each EU. Super Scalar processors accept a traditional sequential stream of instructions but can issue more than are instructions to the EUs in eachcycle. Super scalar processors do not expect dependency free code. They cope with dependencies themselves using hardware. Super scalar processors with the same degree of parallel execution are considerable more complex. Q.3. What is VLIW architecture? Ans. The VLIW architecture is generalized from well established concepts: 1. Horizontal Microcoding 2. Super scalar processing A typical VLIW architecture machine has instruction words of hundreds of bits in length VLIW stands for very long instructions word. Cache mimory Fetch unit Decode unit EU EU EU Register file Multiple instruction Instruction/Data Data EU : Execution unit
VLIW processor VLIW processor expect dependency free code i.e. multi operation code. VLIW processors are statically scheduled. VLIW concept is borrowed from horizontal micro
coding. Different fields of the long instruction word carry the opcodes to be dispatched to different functional units. Q.4. Explain Symbolic Processors? Ans. Symbolic processing has been applied in many cases including theorem proving, pattern recognition, expert systems, knowledge engineering, text retrieval, cognitive science and machine intelligence. In three applications, data and knowledge representation, primitive operations, I/O and special architectural features are different them in numeral computing symbolic processors have been called ‘prolog processors’ or ‘symbolic manipulators’. Main memory Register file Load/ store unit F.P. add unit Integer ALU Branch unit Cache memory Fetch unit Single multi operation interuction EU EU EU EU Register file Multioperation instruction Instruction control Data Eu : Execution unit
Characteristics of Symbolic Processing Attributes Characteristics 1. Knowledge Lists, relational databases, scripts, semantic nets, frames, production system. 2. Common operation Search, sort, pattern matching, filtering, unification, text retrieval, reasoning 3. Memory Requirements Large number with intensine access pattern. Addressing is often content based. 4. Communication pattern Message traffic varies in size and destination granularity & format of message unit change with applications. 5. Properties of Algorithm Non-deterministic, possibly parallel and distributed computations. 6. Input-Output Requirements User-guided programs, intelligent-person machine interface, input can be graphical. 7. Architecture features Parallel update of large knowledge bases, dynamic load balancing, dynamic memory allocation, hardware supported garbage collection. Q.5. What is Virtual memory? In how many classes virtual memory system is categorized? Virtual memory is a concept used in some large compute system, that permits user to construct program as through large memory space were available, equal to the totality of auxillary memory. In memory hierarchy system programs and data are first stored in auxiliary memory. Portion of program and data are then brought into main memory as
they are needed by CPU. Each address referenced by CPU goes through an address mapping from virtual address to physical address in memory. Thus virtual memory always gives an illusion that they have large memory at their disposal, even through computer has relatively small memory. Virtual memory based system provides a mechanism for translating program generated address into correct main memory locations. This all process is done dynamically, when process are executing in main memory. The translation a mapping is handled by automatically using hardware of mapping table. The address used by programs are called virtual address as such addresses set is called address space. An address in main memory is called physical address set of such address is called memory space. Virtual Memory System is categorized in 2 classes:- (1) Those with fixed sized blocks called pages. (2) Those with variable size block called segments. Paging: Paging is a memory management scheme, that permits the physical address space of a process to be non continuous. Fig. Paging Hardware Fig. Segmentation hardware Q.6. Explain Backplane Bus System? A backplane bus interconnects processors, data storage and peripheral devices in a tightly coupled hardware configuration. The system bus must be designed to allow communication between device on the bus without disturbing the interval activities of all devices attached to the bus. Timing protocols must be established to arbitrate among multiple requests. Operational rules must be set to ensure orderly data transfers on the bus. Signal lines on the backplane are often functionally grouped into several buses as shown in the figure. The four groups shown here are very similar to those proposed in the 64 bit VME bus specification. Various functional boards are pluged into slots on the backplane. Each slot is provided with one or more connectors for inserting the boards as demonstrated by the vertical arrows. For example one or two 96-pin connectors are used per slot on the VME backplane. CPU P f0000 Page table d t d Physical memory P t limit Base CPU s d Sigment table yes Physical memory n L + s
Backplane buses, system interfaces and slot connections to various functional boards in a multiprocessor system
CPU Board Processor and Cache Functional modules Interface Logic Memory Board Memory Array Functional Modules Interface Logic Bus controller System clock driver, daisy chain drives Powder driver Bus timer Arbiter Interface Logic Slot 1 Backplanes (signal lines & connectors) Data transfer Bus (DT B) DTB Arbitration Bus Interrupt & Synchronization Bus Utility Bus
Chapter 5 Pipelines Processors Q.1. What are the characterstics of Pipeline? Ans. Pipelining refers to the temporal overlapping of processing pipelines are nothing more than assembly lines in computing that can be used for instruction processing. A basic pipeline process a sequence of tasks or instruction, according to the following principle of operation. Each task is subdivided into a number of successive tasks. The processing of each single instruction can be broken down into four sub tasks:- 1. Instruction Fetch 2. Instruction Decode 3. Execute 4. Write back It is assumed that there is a pipelined stage associated with each subtask. The same amount of time is available in each stage for performing the required subtask. All the pipeline stages operate like an assembly line, that is, receiving their input from the previous stage and delivering their output to next stage. We also assumes, the basic pipeline operates clocked, in other words synchronously. This means that each stage accepts a non input at start of clock cycle, each stage has a single clock cycle available for performing the required operation and each stage increases the result to the next stage by the beginning of subsequent clock cycle. Single task Subtask 1 Subtask 2 Subtask 3 Subtask 4 Subtask 5 Subtask 6 i/p Stage 1 Stage 2 Stage 3 ………. Stage n O/P
Q.2. Explain Linear Pipeline Processors ? Ans. A linear Pipeline processor is a cascade of processing stages which are linearly connected to perform a fixed function over a stream of data flowing from one end to other. In modern computers, linear pipelines are applied for instruction execution, arithmetic computation, memory access operations. A linear pipeline processor is constructed with be processing stages. External inputs are fed into the pipeline at the first stage S1. The processed results are passed from stage Si to stage Si+1 for all i = 1,2…….K-1. The final result emerges from the pipeline at the last stage Sk. Depending on the control of data flow along the pipeline, linear pipelines are model in two categories. Asynchronous Model: Data flow between adjacent stages in asynchronous pipeline is controlled by hankshaking protocol. When stage S1 is ready to transmit, it sends a ready signal to Si + 1. After stage Si+1 receives the incoming data, it returns an acknowledge signal to Si. An Asynchronous pipeline Model Synchronous Model: Clocked latches are used to interface between stages. The latches are made with master slave flip flops, which can isolate inputs from outputs upon the arrival of a clock pulse, all latches transfer data to the next stage simultaneously. A Synchronous pipeline Model Q.3. Explain Non linear Pipeline Processors? Ans. A dynamic pipeline can be reconfigured to perform variable functions at different times. The traditional linear pipelines are static pipelines because they are used to perform fixed functions. A dynamic pipeline allows feed forward and feedback connections in addition to the streamline connections. For this reason, some authors call such a structures an non-linear pipeline. Input Ready Ack Onput Ready Ack Ready Ack Ready Ack S1 S2 Sk S1 S2 Sk L L L L L o o o o o Clock Output
A three stage pipeline This pipeline has three stages. Besides the streamline connections from S1 to S2 and from S2 to S3, there is feed forward connection from S2 to S3 and two feedback connections from S3 to S2 and from S3 to S1. These feed forward and feedback connections make the scheduling of successive event into the pipeline a non trinial task. With these connections, the output of the pipeline is not necessarily from the last stage. In fact, following different
dataflow patterns, one can use the same pipeline to evaluate different functions. Q.4. What is Reservation Table in linear pipelining? Ans. The utilization pattern of successive stages in a synchronons pipeline is specified by reservation table. The table is essentially a space time diagram depicting the precedence relationship in using the pipeline stages. For a K-stage linear pipeline, ‘K’ clock cycles are needed to flow through the pipeline. Reservation table of 4-Stage Q.5. What is Reservations table in Non-linear pipelining? Ans. Reservation table for a dynamic pipeline become more complex and interesting because a non-linear pattern is followed. For a given non-linear pipeline configuration, multiple reservation tables can be generated. Each reservation table will represent evaluation of different function. Each reservation table displays the time space flow of data through the pipeline for one function evaluation. Different function may follows different paths on the reservation table. S1 S2 Sk Output Input Output S1 S2 X X X X 1 2 3 4 S3 S4 Stages
Processing sequence S1 S2 S1 S2 S3 S1 S3 S1 Reservation table for function ‘X’ Q.6. What is Instruction Pipeline Design? Ans. A stream of instructions can be executed by pipeline in an overlapped manner. A typical instruction execution consists of a sequence of operations, including (1) Instruction fetch (2) Decode (3) Operand fetch (4) Execute (5) Write back phases Pipeline instruction processing A typical instruction pipeline has seven stages as depicted below in figures. Fetch stage (F) fetches instructions from a cache memory. Decode stage (D) decode the instruction in order to find function to be performed and identifies the resources needed. Issue stage (I) reserves resources. Resources include GPRs, bases and functional units. The instructions are executed in one or several execute stages (E) Write back stage (WB) is used to write results into the registers. Memory lead and store (L/S) operations are treated as part of solution. Floating point add and multiply operations take four execution clock cycles.
In many RISC processors fewer cycles are needed. S1 S2 X X X X 1 2 3 4 S3 Stages 5 6 7 8 X X X X Fetch F Decode D Issue I Execute E Execute E Execute E Write Back
Ideal cycles when instruction issues are blocked due to resource conflicts before date Y and Z are located in. the store of sum to memory location X must wait three cycles for the add to finish due to flow dependence. Q.7. Explain Arithmetic Pipeline design? Pipeline arithmetic units are usually found in very high speed computers. They are used to implement floating point operation, multiplication of fixed point numbers and similar computations encountered in scientific problems. The exponent are compared by subtracting them to determine their difference. Arithmetic Pipeline for Addition & Subtraction Exponent difference determine how many times the mantissa associated with the smaller exponent must be shifted to the right. This produces are alignment of two mantissas. The teno mantissas are added or subtracted in segment 3. Finally result is normalised in segment 4. R R a Exponents b A Mentissa B Segment 1 Compare exponents by subtraction R Segment 2 Choose exponent Segment 4 Segment 3 Allign mantissa R Add or subtract Mantissa R Normalise Result R R R Adjust exponent
When a overflow occurs, the mantissa of the sum or difference is shifted right and the exponent is incremented by one. When an underflow occurs, the number of leading zeroes in the mantissa determines number of left shifts in the mantissa and the number that must be subtracted from the exponent .