Polymorphic Architectures I 1 Paper 1: The TRIPS ..., Schemes and Mind Maps of Computer Architecture and Organization

EE392C: Advanced Topics in Computer Architecture. Lecture #7. Polymorphic Processors. Stanford University. Tuesday, April 29 2003.

Typology: Schemes and Mind Maps

2022/2023

Uploaded on 05/11/2023

kaijiang
kaijiang 🇺🇸

4.5

(8)

280 documents

1 / 6

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
EE392C: Advanced Topics in Computer Architecture Lecture #7
Polymorphic Processors
Stanford University Tuesday, April 29 2003
Polymorphic Architectures I
Lecture #7: Tuesday, April 22 2003
Lecturer: Jing Jiang and Honggo Wijaya
Scribe: Chi Ho Yue and Rohit Gupta
We are entering an era of ubiquitous computing, as technology scales, more and more
applications demand ever-growing performances. Yet the design complexity grows as
well. In addition, high non-incurring fabrication cost and manufacturing delays demand
chips to be sold in large volume, thus targeting a larger market to be cost effective. How
can we achieve performances comparable to customized solutions in a single chip design?
The answer is Polymorphous Architecture.
One thing is for sure, interconnect is going to be a big issue and therefore any kind of
architecture needs to be scalable in terms of wires. We will look at 2 particular solutions
today, Smart memories and TRIPS.
1 Paper 1: The TRIPS Multiprocessor
TRIPS processor consists 4 out-of-order, 16-wide-issue Grid processor cores, which can
be partitioned to exploit different types of parallelism. It uses software scheduler to
optimize for point-to-point communication.
It’s a block-oriented system in all modes of operations, namely hyberblocks. Programs
are compiled into large blocks of instructions with single entry point, no internal loops
and possible multiple exit points. Each block has a set of state inputs and a potentially
variable set of state outputs that depend upon the exit point from the block. The compiler
is responsible for statically scheduling each block of instructions onto the computation
engine.
Each node of the grid processor consists of an integer ALU, A floating-point unit, a
set of reservation states. Each node can forward the result to any of the operands in the
local of remote reservation states within the ALU.
TRIPS processor has the following resources to achieve configurability. First frame
space is the reservation stations with the same index across all nodes. Next is the register
file banks, which are used for speculation or multithreading etc, depending on the mode
of operation. Block Sequencing controls has various policies for different modes. For
example, deallocation logic maybe configured to allow a block to execute more than
pf3
pf4
pf5

Partial preview of the text

Download Polymorphic Architectures I 1 Paper 1: The TRIPS ... and more Schemes and Mind Maps Computer Architecture and Organization in PDF only on Docsity!

EE392C: Advanced Topics in Computer Architecture Lecture # Polymorphic Processors Stanford University Tuesday, April 29 2003

Polymorphic Architectures I

Lecture #7: Tuesday, April 22 2003 Lecturer: Jing Jiang and Honggo Wijaya Scribe: Chi Ho Yue and Rohit Gupta

We are entering an era of ubiquitous computing, as technology scales, more and more applications demand ever-growing performances. Yet the design complexity grows as well. In addition, high non-incurring fabrication cost and manufacturing delays demand chips to be sold in large volume, thus targeting a larger market to be cost effective. How can we achieve performances comparable to customized solutions in a single chip design? The answer is Polymorphous Architecture.

One thing is for sure, interconnect is going to be a big issue and therefore any kind of architecture needs to be scalable in terms of wires. We will look at 2 particular solutions today, Smart memories and TRIPS.

1 Paper 1: The TRIPS Multiprocessor

TRIPS processor consists 4 out-of-order, 16-wide-issue Grid processor cores, which can be partitioned to exploit different types of parallelism. It uses software scheduler to optimize for point-to-point communication.

It’s a block-oriented system in all modes of operations, namely hyberblocks. Programs are compiled into large blocks of instructions with single entry point, no internal loops and possible multiple exit points. Each block has a set of state inputs and a potentially variable set of state outputs that depend upon the exit point from the block. The compiler is responsible for statically scheduling each block of instructions onto the computation engine.

Each node of the grid processor consists of an integer ALU, A floating-point unit, a set of reservation states. Each node can forward the result to any of the operands in the local of remote reservation states within the ALU.

TRIPS processor has the following resources to achieve configurability. First frame space is the reservation stations with the same index across all nodes. Next is the register file banks, which are used for speculation or multithreading etc, depending on the mode of operation. Block Sequencing controls has various policies for different modes. For example, deallocation logic maybe configured to allow a block to execute more than

once, as is useful in streaming applications. Also memory tiles can be configured as scratch pad memory, synchronization buffers etc.

The strength of this paper is that the processor can deal with a mix load of parallelism, at least it claims. However, the performance numbers are done with a perfect memory in mind which is not usually the case in a real world. The overhead of speculation hardware can not be underestimated either.

2 Paper 2: Smart Memories

2.1 Summary

This paper proposes Smart Memories as a partitioned, explicitly parallel, reconfigurable architecture for use as a future universal computing element. By using Smart Memories, the appearance of the on-chip memory, interconnection network and processing elements can be tailored to better match the application requirements.

Smart Memories contains an array of processor tiles and on-die DRAM memories connected by a packet-based, dynamically routed network. In order to get more com- putation power than what is contained in a single processing tile, four processor tiles are clustered together into a ”quad” and a low-overhead, intra-quad interconnection net- work is provided. By grouping the tiles into quads also makes the global interconnection network more efficient by reducing the number of global network interfaces.

A Smart Memories tile consists of a reconfigurable memory system, a crossbar inter- connection network, a processor core and a quad network interface. Having a reconfig- urable memory system is important since different applications have different memory access patterns. The crossbar interconnection is used to connect memory mats to proces- sors or the quad interface port. The processor itself contains integer and floating point clusters, local register files and shared FP register file to provide the necessary band- width. Each tile can sustain up to two independent threads. Smart Memories also allow for reconfigurable instruction format and decode.

2.2 Results

Smart Memories mapped really well to two different machines on far ends of the ar- chitectural spectrum which require very different memory systems and arrangement of compute resources. The first machine is Imagine, a highly-tuned SIMD/vector machine optimized for media applications with large amount of data parallelism. The second one is Hydra, a speculative multiprocessors that supports application with irregular accesses and communication patterns.

P

$

P

$

Memory

Figure 1: Processor/Cache/Memory Configurations

3.2 Smart Memories

How does one support different combinations of the above processor/cache/memory or- ganizations? Is that necessary?

Depending on each application, the ideal configuration is different. It would ideal to support all possible formations. But beyond a certain point, it becomes a problem of diminishing return with the extra topologies only providing minimal performance gains

As an example, we discuss how SMART memories would have to be adopted to handle a second level of coherence caches. Currently, L2 cache is present on a per-quad basis. How would this change if L2 is shared across multiple quads? In a re-configurable scheme, we need some sort of mapping for every access scenario since the processors must know how to handle memory requests. This limits the complexity and number of mappings that can be supported.

In order to support distributed L2 cache, the inter-quad connections need to provide the required bandwidth for inter-processor connection. It also needs to support some kind of coherency protocol. An ”easy” way out would be to use message passing to explicitly share data without worrying coherency. One way to implement coherency is to use a directory based approach. Every processor can search the directory to locate the required page. Generally specking, the directory size is proportional to the main memory size. In order to avoid each quad have a large directory that can take up significant portion of the L2 cache, we can implement some sort of distributed directory scheme. Every quad would be assigned a range of addresses for which that quad is responsible to store directory information. By this scheme, every processor knows where to send its memory requests and the system can avoid duplicating pages. An alternate to distributed directory would be to store the directory in a memory access controller at the memory interface. This could potentially simplify the solution at the cost of increasing latency.

3.3 FPGA

Perhaps the FPGA can be viewed as the ”ultimate” polymorphic architecture since most FPGAs are bit-re-configurable. However, FPGAs today are moving toward a more block- oriented approach with shorter programming times at the cost of reduced flexibility. Amongst the major challenges faced by the FPGA is having a good programming model and associated compilers that can efficiently utilize its resources.

3.4 Compiler

Compilers used in multithreaded environments require the capability to extract hyper blocks and perform some (hopefully aggressive) amount of speculation. A hyper block is defined as a chunk of code that is executed sequentially and there is only one point of entry into that particular block. There may be multiple exit points. The advantage to this scheme to this scheme is that we only have to figure the dependencies (data and control) within a single hyper block. Further, this should reduce I cache misses since a large number of instruction is already known. Ideally, we want to make hyper blocks large, but it is hard to find such contiguous sections of code in normal programs. One method to make commonly executed streams of code into a large block, and the rare cases where execution comes in mid-way through the block has to be handled by replicating relevant sections of code. Again, this is a tradeoff between code-size and potential performance gains.

In order to efficiently utilize DLP, we need parallelizing compilers that go beyond the usual 4-way (or larger) VLIW design and can parallelize arbitrary loops and tasks. An important consideration for any compiler design is obviously the programming model. For example, depending on programming model, one can use message passing or memory- based synchronization. This will affect the compiler’s design and its abilities.

The programming model is highly dependent on the kind of applications running. Furthermore, any given application of medium complexity may have different sections that exhibit very different behavior. For example, a visual or speech recognition ap- plication will have a DSP module and recognition module. A DSP algorithm typically exhibits a lot of DLP, whereas the recognition may be done using table-lookup matching type schemes which involve random memory accesses. Efficiently programming these components would typically use drastically different programming models

Ideally, the compiler should be able to figure out all types of parallelism. However, compilers today are not smart enough to do that. Even optimized compilers are depen- dent on good programming semantics to extract ILP for VLIW processors even though these compilers have been used for decades. Needless to say, we need a good programming model that will explicitly expose TLP and DLP. Popular high-level languages such as C do not do that. Further, a question that needs to be addressed is what level of familiarity