







Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
A project report on power aware packet processing using chipmultiprocessor (cmp) architecture. The report proposes a power management policy to select a proper array of processors to lower power consumption while maintaining qos requirements. The cmp model is shown to have approximately 40% power reduction compared to cmp without power management and an 11% power improvement compared to symmetric cmp approach.
Typology: Study Guides, Projects, Research
1 / 13
This page cannot be seen from the preview
Don't miss anything!








Spring 2004 Project Report
Zhen Ma and Weifeng Zhang Department of Computer Science & Engineering University of California, San Diego {zhma,weifeng}@cs.ucsd.edu
Network processors implemented as systems-on-chip with multiple processors and peripherals offer a reliable means of scaling network with high link capacities. As more and more co-processors and peripherals are integrated, the power requirement also dramatically increases. Therefore it is essential to efficiently parallelize the subsystems to maximize the packet processing capacities while maintaining low power consumption.
In this project, we propose a power aware packet processing architecture with chip- multiprocessor (CMP), which consists of a number of processor clusters (or arrays). Each array includes a number of identical processor cores, and processor cores between different arrays have different performance and power consumption. Only one array of processors is active at any time. We devise a simple policy to select a proper array of processors to lower the power consumption while still meeting the QoS requirements. Our simulation results show that the proposed CMP model has an approximately 40% power reduction compared to the CMP without power management, and an 11% power improvement compared to the symmetric CMP approach.
The various emerging network applications, such as VoIP, IPv4/IPv6 gateways, software routers, VPN, intrusion detection, stimulate designs of new packet processing systems, which require both high packet processing capacities and programmable flexibility. Traditional ASICs lack programmable flexibility, and general-purpose processors cannot deal with Gigabit link capacities. Network processors emerge as the solution by integrating multiprocessors on chip and exploiting aggressive parallelism inherent in the network workload. However, the trend to integrate more and more functionality on the same silicon die, as well as the continuous exponential growth of link capacities, makes the power consumption of the network processors a challenging issue.
Figure1 shows the trends of the power requirements as the processor functionality increases in Intel IXP series. With dense integration and high performance requirement, it is certainly a design challenge to meet the tight system power budget.
Figure1. Power requirements for IXP series (from [8])
Network processors are usually designed to target the maximum traffic loads. However, internet traffic analysis reveals that the packet traffic fluctuates significantly over time. The common traffic load is often substantially lower than the expected peak load, and the traffic shows bursty behaviors on all time scales [3, 13, 18]. In addition to internet traffic load, there are various network traffic patterns. For example, a LAN or a home network router may have low traffic most of time, with only sparse spikes of bursty behaviors.
With the emergence of a rich set of network applications, such as voice over IP and security applications, network processor performance is also limited by the memory latencies. Processors may spend a large portion of time waiting for data from lower memory hierarchy.
The above opportunities have been explored in the circuit level to achieve power efficient packet processing. There is also a large space to explore in architectural and OS levels.
In this project, we propose a novel CMP model for power aware packet processing. The CMP architecture consists of a number of processor clusters (arrays), each of which has a number of identical processor cores. The processor cores in different arrays are geared for different level of performance and power requirement. At any time, only one array of cores is active, and the processor array is activated / deactivated based on the traffic load and other constraints (such as throughput, latency, etc). We devise a simple switching policy to use the low performance/power array of cores when the traffic is low and the high performance/power when the traffic is heavy. The high performance array is capable of handling the peak/bursty traffic load, and the low performance array for the non-peak load to reduce the power consumption.
This paper is organized as follows: section 2 briefly describes the existing power reduction techniques. Section 3 elaborates the proposed CMP model for power aware
Our CMP architecture extends Kumar’s approach. Instead of a sequence of processor cores with different power/performance, we have variable power/performance clusters, and each cluster includes identical cores. This extended architecture aims to provide high processing capacity to handle peak network traffic loads, while allowing power saving when the traffic is low.
There are two existing CMP models for packet processing. The first model has a control core with a number of high performance microengines. A typical example is the Intel IXP2800 network processor. The processor contains a 700 MHz XScale core and sixteen 1.4GHz high performance microengines which support both DVS and DPM. However, to the best of our knowledge, there is no existing implementation for DVS or DPM on IXP series yet. The second model is a theoretical model provided in [8], which contains a number of identical processors. This symmetric CMP dynamically changes the number of running processors according to the packet load to reduce power consumption.
In this paper, we present a third type of power aware CMP model to process network packets, as illustrated in Figure2. For simplicity, we assume it contains two arrays of processor cores. Each array includes two homogeneous processor cores, but two arrays have different performance / power consumption. For example, the high performance array can be two Alpha 21464 (250Watt, die size 350mm^2 ), and the low performance array can be two Alpha 21064 (5 Watt, 1 die size 2.87mm^2 ). Only one array of cores is active at any time. The processors in the active array monitor their own processing latency, response time, throughput, and the packet buffer queue utilization. When the workload becomes light (or heavy), the active array transfers the control to the low (high) performance array, which in turn shuts down the active array. Because of the different power consumption on the CMP cores, the dynamic peer-to-peer processor migration should reduce the power consumption while still sustaining the QoS requirements.
Figure2. the power aware CMP model
The incoming packet stream is buffered in the packet queue (FIFO). When the event of low packet queue utilization occurs, it triggers the system to switch the low performance array. Likewise, it switches to the high performance array with the high queue utilization event. To ensure the packet queue overflow (packets dropped), the proper utilization threshold to trigger the switching event is essential.
We devise a state machine to control the transition between different arrays. The processor cores in an array are in the same state. Figure3 shows the state machine for our proposed CMP model. There are three states in the simulator.
Figure3. State machine for the proposed power aware CMP model
Initially, the system starts up with the high performance array in the running state, and it is expected to stay in the low performance array most of time. If the packet queue underflows (empty), the packet processing application receives an empty packet. The application could pause to reduce execution power, or the low performance array could be clock-gated to reduce power even further. We don’t simulate the clock gating in this project; instead, we let the application drop the empty packet and keep reading for new packets.
We modify SimpleScalar 3.0 [2] to model our multi-array CMP architecture. And the power consumption is modeled using Wattch [1]. The different cores are modeled with different cache sizes, fetching/execution bandwidths, and die sizes.
4.1 Packet queue manager
In order to simulate packet processing, we also simulate the functions of the hardware queue manager. The queue manager checks the packet time stamp to emulate the true
0
10
20
30
40
50
60
70
80
1 41 81 121 161 201 241 281 321 361 401 441 481 521 561 601 millse cond
# of packets
Figure4. realistic traffic pattern
The goal for the synthetic trace is to simulate traffic patterns for other network applications in addition to Internet traffic. For example, for a home network router, it does have more low traffic periods during workdays, while the users are out for work, but much higher traffic load during the evening and weekends.
0
10
20
30
40
50
60
70
1 59 117 175 233 291 349 407 465 523 581 639 697 755 813 millise cond
# of packets
Figure5. synthetic traffic pattern
In the section, we describe the detailed core configuration and the simulation results for both the realistic and synthetic traces. We define performance metrics for performance comparison and analyze the obtained power savings.
5.1 Core configuration and power/performance difference
The core configuration is shown in Table1. The high performance array has two identical fast cores, while the low performance array has two identical slow cores. They have different L1 cache size, bandwidth, frequency (frequency could also be different for more power savings), die size (the low performance array will only add a very small portion of area and cost), and different power scaling factors (from Wattch). We then use Wattch to evaluate the power consumption of these two types of cores. We find that the power consumption of the fast cores is about 4 times of that of the slow cores.
We also run the SPEC2000 benchmark on SimpleScalar 3.0 for the two types of cores respectively, and find that in terms of performance, the fast cores is approximately 30% faster than the slow ones (Table2).
Fast Cores Slow Cores
L1 Cache 4k 16k
Bandwidth 4 8
Frequency 733MHz 733MHz
Die size 18 mm 1mm
Power 217 Watt 54 Watt
Bzip2 0.8139 0. crafty 0.9734 0. Eon 1.0987 0. Gap 1.1379 0. Gzip 1.9602 1. Mcf 0.04 0. parser 0.5976 0. Perl 2.2856 1. twolf 0.5821 0. vortex 1.058 0. Vpr 0.594 0. Table1. CMP configurations average 1.012855 0.
Table2. performance of SPEC2000 benchmark
5.2 Performance
We run the CMP simulator to process 100,000 packets from both the realistic traffic trace and synthetic traffic trace in three modes (slow cores only, fast cores only, and switching- enabled only) respectively. We evaluate the performance by the following performance metrics: queue_utilization (the queue size is 512 in our simulation); Latency (the time between a packet is enqueued and dequeued); Processing cycles (the time to process a packet).
Normalized total energy consumption
Total cycles
Fast-cores 100 3.58x Slow-cores 55.0 8.01x108 fast
Switching-enabled 56.1 0.79x108 fast (13.8%) 4.96x108 low (86.2%) 251 switching
Table 6. power consumption and processing time for synthetic traffic
We find that we have 40.3% energy savings for the realistic traffic and 43.9% energy savings for the synthetic traffic relative to the fast-cores only approach. The power consumption is very close to the slow-cores only approach (especially for the synthetic traffic) without suffering any packet drops. For the realistic traffic, the system spends about 73% of execution time in the slow array, and 86% for the synthetic traffic. We believe our CMP mode has strong potential to handle the sparse spiky traffic loads efficiently.
We also did a preliminary evaluation to compare our CMP model with the symmetric CMP approach, and we find an 11% power reduction relative to the symmetric CMP approach.
In this project, we propose a power aware packet processing architecture with CMP. This CMP architecture with single ISA multi-cores provides an alternative network processor model for power conservation. By simulation, we show for both realistic and synthetic traffic traces, our proposed CMP model can achieve approximately 40% energy savings relative to network processor architectures without power management. We believe we may have even more potential savings for traffic patterns with more bursty behaviors.
In order to make our approach more energy efficient, we need to add dynamic online prediction of traffic load and thus design more efficient switching policy and precise adaptive queue utilization threshold. Also we may combine the inter-array core switching with intra-array core shutdown for more energy savings.
The CMP APIs :
void CMP_queue_read(struct regs_t *regs, mem_access_fn mem_fn, struct mem_t *mem, md_inst_t inst, int traceable); void CMP_queue_write(); void CMP_queue_init(struct CMP_state *state); void CMP_queue_finish(md_inst_t inst);
int CMP_queue_fulling(); int CMP_queue_draining();
void CMP_switch_processor( int processor ); void CMP_power_scale( int processor );
The primitive implementation :
void volatile read_primitive(CMP_queue *buffer) { int tmp;
asm volatile ("addq $31, %0, $1"::"r"(buffer) ); asm volatile ("bis $31, 1, $31"); }
void volatile finish_primitive() { asm volatile ("bis $31, 2, $31"); }