



Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
EE392C: Advanced Topics in Computer Architecture. Lecture #8. Polymorphic Architectures. Stanford University. Thursday, April 24, 2003.
Typology: Exams
1 / 7
This page cannot be seen from the preview
Don't miss anything!




EE392C: Advanced Topics in Computer Architecture Lecture # Polymorphic Architectures Stanford University Thursday, April 24, 2003
Lecture #8: Thursday, April 24, 2003 Lecturer: Joel Coburn, John Kim Scribe: David Bloom, Amin Firoozshahian
Continuing the study of reconfigurable architectures, two other architectures with finer grain reconfiguration are covered: PipeRench [1] and MIT RAW processor [2]. MIT RAW processor consists of sixteen single issue cores connected via a low latency communication network over the chip. PipeRench is a reconfigurable array of processing elements and programmable interconnection networks between them, targeted to exploit parallelism mostly in data parallel applications. Compared to the Smart Memories [3] and TRIPS [4] architectures, these two have finer basic blocks (PEs in PipeRench and single issue cores in RAW), which expose more architectural aspects to the compiler and demand more static scheduling of the code.
PipeRench is a co-processor for streaming multimedia applications. It is different from the other polymorphic architectures since it is an attached processor. Also, unlike the other polymorphic architectures, PipeRench is a more fine-grain polymorphic architecture and resembles an FPGA with a coarser granularity. The granularity chosen for this architecture is 8-bit processing elements as this width provides the best tradeoff between data path utilization and complexity. Since most data elements for streaming multimedia are 8-16 bits wide, it made more sense to use a smaller granularity instead of a conventional 32bit data path.
The architecture consists of network of configurable logic and storage elements within each processing element (PE). A row of PE's creates a "stripe" within the architecture and stripes are stacked on top of each other with local interconnects between them to create the configuration fabric. Each stripe also represents a pipe stage of the hardware.
A unique feature of this architecture is that it is a "pipelined reconfigurable architecture". Even though the number of physical stripes on silicon might be limited, through the
process of virtualizing hardware, the number of virtual stripes can exceed the number of physical stripes. This virtualization brings several benefits:
The compiler is isolated from the hardware since compiler does not need to know how many physical stripes there are.
Performance that you can obtain from a given number of stripes is much greater (i.e. if you have 16 physical stripes and 128 virtual stripes, you can get 8x the performance with just 16 stripes). However, this does bring complication since it restricts the model of computation to pipelined data path, where each pipeline stage corresponds to a stripe.
Like most other polymorphic architectures, one of the critical elements of obtaining high performance is relying on the compiler. For PipeRench, a dataflow intermediate language (DIL) that is a single-assignment language with C operators was used for compilation.
The results illustrated by this paper showed that there is possibly a 10x-200x speedup on various kernels, but those results did not account for the I/O limitations of the architecture. On one particular application (IDEA), the architecture obtained 10x performance increase over a general-purpose processor as well as a custom hardware.
The advantage of this architecture is that it resolves many of the disadvantages of the FPGAs, which includes forward compatibility, rapid reconfiguration, and compilability. This architecture is also probably easier to implement in VLSI since only a single type of PE and the interconnections need to be designed. However, there are several limitations for this architecture, which is mainly related to the limited bandwidth between the PipeRench and the main processor and main memory. As a result, there are very few applications that are suitable for this architecture.
This paper operates on the premise that future architectures will have distributed resources in order to meet the demands of greater parallelism and faster clock rates. Distributed resources require non-uniform access latencies. Instruction scheduling becomes both a spatial and temporal problem. The authors propose a compiler, RAWCC, for general-purpose sequential programs on the RAW machine. Through space-time scheduling, the compiler can exploit ILP within basic blocks.
As in the previous lecture, the first point to discuss was the granularity of the reconfiguration. Namely, discussing PipeRench, it is obvious that it has finer configuration. This imposes more overhead on the hardware while trying to implement more complicated functions than PEs provide and also makes the compiler’s job more difficult, since now instead of only using a functional unit, the compiler is also responsible for synthesizing it out of PEs and the interconnection network between them. In other words, now the compiler also has the responsibility of virtualizing the targeted architecture and then using it.
On the other hand, compared to FPGAs, having ALUs in form of PEs and specific registers at each PE reduces the hardware overhead and speeds up configuration time a lot. This observation led to some questions about common algorithms for placement and routing in FPGAs and how they can be accelerated. One of the suggested ways was to look up for some predefined configurations in a table and load it into the FPGA whenever such a configuration is encountered. Another suggestion to program an array of re- configurable elements was try to map simple VLIW cores at each row and instead of having specific architectural registers, pass the results directly between them.
One main question brought up during this part of discussion was the difference between configuration and programming. Two main definitions given were:
It was noted that the difference between these two is often not very clear, and they are in fact often confused. For instance, the configuring that is described in the PipeRench paper actually controls changes in the configuration on a cycle-by-cycle basis, so it should probably more correctly be referred to as programming.
As mentioned before, in PipeRench, functionality of each pipe stage changes in every cycle. This adds to the complexity of the compiler, since it has to decide about all these changes statically at compile time and know which stage is going to perform which part of the functionality before running code on the fabric. On the other hand, in systems like Smart Memories, the distinction between compiling the program and configuration concept is more realizable.
Another issue brought into attention was exception handling in architectures like PipeRench or RAW. Or, a more fundamental question would be if it is needed to support precise exceptions or not. The difficulty is by spreading out execution cores over the chip and requiring explicit communication between them, original order of the execution is
lost. But the advantage in such architectures is that since they have ignored virtual memory issues, precise exceptions are not explicitly required.
Also, in PipeRench paper, the authors did not mention anything about memory and its hierarchy. The importance is that feeding all the processing elements with data requires enough memory bandwidth and it can potentially be a bottleneck. While in some architectures like Imagine this bandwidth and its hierarchy are emphasized, it is not mentioned in the PipeRench architecture at all.
Returning to discussion about exceptions, we notice that currently precise exceptions are supported at the instruction level. An interesting experiment would be to change the level of preciseness, e.g. supporting precise exceptions at basic block boundaries. For supporting such a scheme couple of requirements comes to mind at first glance:
Similar to the granularity discussion about the various configurations, it was noted that with coarser granularity, implementations may be simpler (in this case, precise exception handling is more feasible if done at block level, as opposed to the instruction level, since the whole block can be replayed in case of exception).
Discussing different aspects of the RAW architecture, one of the interesting observations was that RAW somehow acts as a dual for the TRIPS. While TRIPS tries to turn thread and data level parallelism to instruction level parallelism by executing them in the frames, RAW supports data and instruction level parallelism by explicitly distributing computation to different processing cores and run them as different threads. This implies fast communication between processing cores, which is taken care of by the low latency on-chip network. The compiler puts all statically predictable communications on the static on-chip network and ordering is never changed in this network. The difficulty then would be adapting this method of exploiting parallelism (converging DLP and ILP to TLP) to applications with more dynamic code. In fact, the authors do not evaluate ILP programs to see how much this scheme might be capable for applications with considerable data dependent branches or pointer chasing sections.
As a side discussion, decoupled architectures were brought into attention by the class. The main idea behind decoupled architectures is to separate the stream of instructions in the program to two semi-independent instruction streams: Load/Store instruction stream and arithmetic instruction stream and make the first stream run faster in order to bring in data required for computation before it is actually required. As figure 1 shows, the communication between these two instruction streams are done via queues. This leads to an out of order issue machine, which uses queues instead of register renaming. This