Power Aware Packet Processing with Chipmultiprocessor (CMP) - Project Report, Study Guides, Projects, Research of Computer Science

A project report on power aware packet processing using chipmultiprocessor (cmp) architecture. The report proposes a power management policy to select a proper array of processors to lower power consumption while maintaining qos requirements. The cmp model is shown to have approximately 40% power reduction compared to cmp without power management and an 11% power improvement compared to symmetric cmp approach.

Typology: Study Guides, Projects, Research

Pre 2010

Uploaded on 04/12/2010

koofers-user-zue-1
koofers-user-zue-1 🇺🇸

10 documents

1 / 13

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CSE 237A
Spring 2004
Project Report
Dynamic Power Aware Packet Processing with CMP
Zhen Ma and Weifeng Zhang
Department of Computer Science & Engineering
University of California, San Diego
{zhma,weifeng}@cs.ucsd.edu
Abstract
Network processors implemented as systems-on-chip with multiple processors and
peripherals offer a reliable means of scaling network with high link capacities. As more
and more co-processors and peripherals are integrated, the power requirement also
dramatically increases. Therefore it is essential to efficiently parallelize the subsystems to
maximize the packet processing capacities while maintaining low power consumption.
In this project, we propose a power aware packet processing architecture with chip-
multiprocessor (CMP), which consists of a number of processor clusters (or arrays). Each
array includes a number of identical processor cores, and processor cores between
different arrays have different performance and power consumption. Only one array of
processors is active at any time. We devise a simple policy to select a proper array of
processors to lower the power consumption while still meeting the QoS requirements.
Our simulation results show that the proposed CMP model has an approximately 40%
power reduction compared to the CMP without power management, and an 11% power
improvement compared to the symmetric CMP approach.
1. Introduction
The various emerging network applications, such as VoIP, IPv4/IPv6 gateways, software
routers, VPN, intrusion detection, stimulate designs of new packet processing systems,
which require both high packet processing capacities and programmable flexibility.
Traditional ASICs lack programmable flexibility, and general-purpose processors cannot
deal with Gigabit link capacities. Network processors emerge as the solution by
integrating multiprocessors on chip and exploiting aggressive parallelism inherent in the
network workload. However, the trend to integrate more and more functionality on the
same silicon die, as well as the continuous exponential growth of link capacities, makes
the power consumption of the network processors a challenging issue.
pf3
pf4
pf5
pf8
pf9
pfa
pfd

Partial preview of the text

Download Power Aware Packet Processing with Chipmultiprocessor (CMP) - Project Report and more Study Guides, Projects, Research Computer Science in PDF only on Docsity!

CSE 237A

Spring 2004 Project Report

Dynamic Power Aware Packet Processing with CMP

Zhen Ma and Weifeng Zhang Department of Computer Science & Engineering University of California, San Diego {zhma,weifeng}@cs.ucsd.edu

Abstract

Network processors implemented as systems-on-chip with multiple processors and peripherals offer a reliable means of scaling network with high link capacities. As more and more co-processors and peripherals are integrated, the power requirement also dramatically increases. Therefore it is essential to efficiently parallelize the subsystems to maximize the packet processing capacities while maintaining low power consumption.

In this project, we propose a power aware packet processing architecture with chip- multiprocessor (CMP), which consists of a number of processor clusters (or arrays). Each array includes a number of identical processor cores, and processor cores between different arrays have different performance and power consumption. Only one array of processors is active at any time. We devise a simple policy to select a proper array of processors to lower the power consumption while still meeting the QoS requirements. Our simulation results show that the proposed CMP model has an approximately 40% power reduction compared to the CMP without power management, and an 11% power improvement compared to the symmetric CMP approach.

1. Introduction

The various emerging network applications, such as VoIP, IPv4/IPv6 gateways, software routers, VPN, intrusion detection, stimulate designs of new packet processing systems, which require both high packet processing capacities and programmable flexibility. Traditional ASICs lack programmable flexibility, and general-purpose processors cannot deal with Gigabit link capacities. Network processors emerge as the solution by integrating multiprocessors on chip and exploiting aggressive parallelism inherent in the network workload. However, the trend to integrate more and more functionality on the same silicon die, as well as the continuous exponential growth of link capacities, makes the power consumption of the network processors a challenging issue.

Figure1 shows the trends of the power requirements as the processor functionality increases in Intel IXP series. With dense integration and high performance requirement, it is certainly a design challenge to meet the tight system power budget.

Figure1. Power requirements for IXP series (from [8])

Network processors are usually designed to target the maximum traffic loads. However, internet traffic analysis reveals that the packet traffic fluctuates significantly over time. The common traffic load is often substantially lower than the expected peak load, and the traffic shows bursty behaviors on all time scales [3, 13, 18]. In addition to internet traffic load, there are various network traffic patterns. For example, a LAN or a home network router may have low traffic most of time, with only sparse spikes of bursty behaviors.

With the emergence of a rich set of network applications, such as voice over IP and security applications, network processor performance is also limited by the memory latencies. Processors may spend a large portion of time waiting for data from lower memory hierarchy.

The above opportunities have been explored in the circuit level to achieve power efficient packet processing. There is also a large space to explore in architectural and OS levels.

In this project, we propose a novel CMP model for power aware packet processing. The CMP architecture consists of a number of processor clusters (arrays), each of which has a number of identical processor cores. The processor cores in different arrays are geared for different level of performance and power requirement. At any time, only one array of cores is active, and the processor array is activated / deactivated based on the traffic load and other constraints (such as throughput, latency, etc). We devise a simple switching policy to use the low performance/power array of cores when the traffic is low and the high performance/power when the traffic is heavy. The high performance array is capable of handling the peak/bursty traffic load, and the low performance array for the non-peak load to reduce the power consumption.

This paper is organized as follows: section 2 briefly describes the existing power reduction techniques. Section 3 elaborates the proposed CMP model for power aware

Our CMP architecture extends Kumar’s approach. Instead of a sequence of processor cores with different power/performance, we have variable power/performance clusters, and each cluster includes identical cores. This extended architecture aims to provide high processing capacity to handle peak network traffic loads, while allowing power saving when the traffic is low.

3 Power Aware CMP Model

There are two existing CMP models for packet processing. The first model has a control core with a number of high performance microengines. A typical example is the Intel IXP2800 network processor. The processor contains a 700 MHz XScale core and sixteen 1.4GHz high performance microengines which support both DVS and DPM. However, to the best of our knowledge, there is no existing implementation for DVS or DPM on IXP series yet. The second model is a theoretical model provided in [8], which contains a number of identical processors. This symmetric CMP dynamically changes the number of running processors according to the packet load to reduce power consumption.

In this paper, we present a third type of power aware CMP model to process network packets, as illustrated in Figure2. For simplicity, we assume it contains two arrays of processor cores. Each array includes two homogeneous processor cores, but two arrays have different performance / power consumption. For example, the high performance array can be two Alpha 21464 (250Watt, die size 350mm^2 ), and the low performance array can be two Alpha 21064 (5 Watt, 1 die size 2.87mm^2 ). Only one array of cores is active at any time. The processors in the active array monitor their own processing latency, response time, throughput, and the packet buffer queue utilization. When the workload becomes light (or heavy), the active array transfers the control to the low (high) performance array, which in turn shuts down the active array. Because of the different power consumption on the CMP cores, the dynamic peer-to-peer processor migration should reduce the power consumption while still sustaining the QoS requirements.

Figure2. the power aware CMP model

The incoming packet stream is buffered in the packet queue (FIFO). When the event of low packet queue utilization occurs, it triggers the system to switch the low performance array. Likewise, it switches to the high performance array with the high queue utilization event. To ensure the packet queue overflow (packets dropped), the proper utilization threshold to trigger the switching event is essential.

We devise a state machine to control the transition between different arrays. The processor cores in an array are in the same state. Figure3 shows the state machine for our proposed CMP model. There are three states in the simulator.

Figure3. State machine for the proposed power aware CMP model

  • Running: the packet processing application reads the packets from the queue and processes them. It does transition to the draining state when the queue threshold event occurs.
  • Draining: stop fetching instructions and continue execution of instruction in the instruction queue. If the processor pipeline is in the error recovery (such as branch misprediction), the error recovery takes the priority. The length of draining state largely depends on the execution latency from the existing instructions. Meanwhile, the new array can be wakeup to reduce the startup latency. When the pipeline is empty, it does transition to the switching state.
  • Switching: activate the new array, shut down the old array, and start fetching and execution from where it left.

Initially, the system starts up with the high performance array in the running state, and it is expected to stay in the low performance array most of time. If the packet queue underflows (empty), the packet processing application receives an empty packet. The application could pause to reduce execution power, or the low performance array could be clock-gated to reduce power even further. We don’t simulate the clock gating in this project; instead, we let the application drop the empty packet and keep reading for new packets.

4 Simulation Implementation

We modify SimpleScalar 3.0 [2] to model our multi-array CMP architecture. And the power consumption is modeled using Wattch [1]. The different cores are modeled with different cache sizes, fetching/execution bandwidths, and die sizes.

4.1 Packet queue manager

In order to simulate packet processing, we also simulate the functions of the hardware queue manager. The queue manager checks the packet time stamp to emulate the true

0

10

20

30

40

50

60

70

80

1 41 81 121 161 201 241 281 321 361 401 441 481 521 561 601 millse cond

# of packets

Figure4. realistic traffic pattern

The goal for the synthetic trace is to simulate traffic patterns for other network applications in addition to Internet traffic. For example, for a home network router, it does have more low traffic periods during workdays, while the users are out for work, but much higher traffic load during the evening and weekends.

0

10

20

30

40

50

60

70

1 59 117 175 233 291 349 407 465 523 581 639 697 755 813 millise cond

# of packets

Figure5. synthetic traffic pattern

5 Power/Performance Evaluation

In the section, we describe the detailed core configuration and the simulation results for both the realistic and synthetic traces. We define performance metrics for performance comparison and analyze the obtained power savings.

5.1 Core configuration and power/performance difference

The core configuration is shown in Table1. The high performance array has two identical fast cores, while the low performance array has two identical slow cores. They have different L1 cache size, bandwidth, frequency (frequency could also be different for more power savings), die size (the low performance array will only add a very small portion of area and cost), and different power scaling factors (from Wattch). We then use Wattch to evaluate the power consumption of these two types of cores. We find that the power consumption of the fast cores is about 4 times of that of the slow cores.

We also run the SPEC2000 benchmark on SimpleScalar 3.0 for the two types of cores respectively, and find that in terms of performance, the fast cores is approximately 30% faster than the slow ones (Table2).

Fast Cores Slow Cores

L1 Cache 4k 16k

Bandwidth 4 8

Frequency 733MHz 733MHz

Die size 18 mm 1mm

Power 217 Watt 54 Watt

Bzip2 0.8139 0. crafty 0.9734 0. Eon 1.0987 0. Gap 1.1379 0. Gzip 1.9602 1. Mcf 0.04 0. parser 0.5976 0. Perl 2.2856 1. twolf 0.5821 0. vortex 1.058 0. Vpr 0.594 0. Table1. CMP configurations average 1.012855 0.

Table2. performance of SPEC2000 benchmark

5.2 Performance

We run the CMP simulator to process 100,000 packets from both the realistic traffic trace and synthetic traffic trace in three modes (slow cores only, fast cores only, and switching- enabled only) respectively. We evaluate the performance by the following performance metrics: queue_utilization (the queue size is 512 in our simulation); Latency (the time between a packet is enqueued and dequeued); Processing cycles (the time to process a packet).

Normalized total energy consumption

Total cycles

Fast-cores 100 3.58x Slow-cores 55.0 8.01x108 fast

Switching-enabled 56.1 0.79x108 fast (13.8%) 4.96x108 low (86.2%) 251 switching

Table 6. power consumption and processing time for synthetic traffic

We find that we have 40.3% energy savings for the realistic traffic and 43.9% energy savings for the synthetic traffic relative to the fast-cores only approach. The power consumption is very close to the slow-cores only approach (especially for the synthetic traffic) without suffering any packet drops. For the realistic traffic, the system spends about 73% of execution time in the slow array, and 86% for the synthetic traffic. We believe our CMP mode has strong potential to handle the sparse spiky traffic loads efficiently.

We also did a preliminary evaluation to compare our CMP model with the symmetric CMP approach, and we find an 11% power reduction relative to the symmetric CMP approach.

6 Conclusion

In this project, we propose a power aware packet processing architecture with CMP. This CMP architecture with single ISA multi-cores provides an alternative network processor model for power conservation. By simulation, we show for both realistic and synthetic traffic traces, our proposed CMP model can achieve approximately 40% energy savings relative to network processor architectures without power management. We believe we may have even more potential savings for traffic patterns with more bursty behaviors.

In order to make our approach more energy efficient, we need to add dynamic online prediction of traffic load and thus design more efficient switching policy and precise adaptive queue utilization threshold. Also we may combine the inter-array core switching with intra-array core shutdown for more energy savings.

Reference

  1. D. Brooks, V. Tiwari, M. Martonosi, Wattch: A Framework for Architectural-level Power Analysis and Optimizations, in Proceedings of the 27th International Symposium on Computer Architecture, 2000, pp. 83-94.
  2. D. Burger, T. Austin. The SimpleScalar Tool Set, Version 2.0. Technical Report CS- TR-97-1342, University of Wisconsin, June 1997.
  3. A. Erramilli, O. Narayan and W. Willinger. Experimental queueing analysis with long- range dependent packet traffic. IEEE/ACM Transactions on Networking, 4(2):209-223,
  4. M.A. Franklin, T. Wolf, Power Coniderations in Network Processor Design. In Proceedings of Second Network Processor Workshop in conjunction with Ninth International Symposium on High Performance Computer Architecture (HPCA-9), pages 10– 22, 2003.
  5. Intel IXP2400 Network Processor, http://www.intel.com/design/network/products/npfamily/ixp2400.htm
  6. S. Irani, S.K. Shukla, R.K. Gupta, Algorithms for power savings, in Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, January 2003
  7. S. Kaxiras, G. Keramidas, IPStash: A Power-Efficient Memory Architecture for IP Lookup, in 36th International Symposium on Microarchitecture, December, 2003.
  8. R. Kokku, U.B. Shevade, N.S. Shah, M.Dahlin, H.M. Vin, Energy Aware Packet Processing, University of Texas at Austin Technical Report # TR04-04.
  9. R. Kumar, K. Farkas, N.P. Jouppi, P. Ranganathan, D.M. Tullsen, Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction, in 36th International Symposium on Microarchitecture, December, 2003.
  10. M. Nikitovic and M. Brorsson, An adaptive chip-multiprocessor architecture for future mobile terminals, in Proceedings of the 2002 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES'02), Oct. 2002.
  11. NLANR Network Traffic Packet Header Traces. http://pma.nlanr.net/Traces/.
  12. P. Pillai and K. G. Shin. Real-time Dynamic Voltage Scaling for Low-Power Embedded Operating Systems. In Proceedings of the eighteenth ACM Symposium on Operating Systems Principles,pages 89–102. ACM Press, 2001.
  13. Y. Qiao, J. Skicewicz, and P. Dinda. Multiscale Predictability of Network Traffic. Technical report.
  14. RIPE Network Coordination Center. http://www.ripe.net

The CMP APIs :

void CMP_queue_read(struct regs_t *regs, mem_access_fn mem_fn, struct mem_t *mem, md_inst_t inst, int traceable); void CMP_queue_write(); void CMP_queue_init(struct CMP_state *state); void CMP_queue_finish(md_inst_t inst);

int CMP_queue_fulling(); int CMP_queue_draining();

void CMP_switch_processor( int processor ); void CMP_power_scale( int processor );

The primitive implementation :

void volatile read_primitive(CMP_queue *buffer) { int tmp;

asm volatile ("addq $31, %0, $1"::"r"(buffer) ); asm volatile ("bis $31, 1, $31"); }

void volatile finish_primitive() { asm volatile ("bis $31, 2, $31"); }