Boosting Superscalar Performance with Store Sets: Memory Dependence Prediction, Papers of Computer Architecture and Organization

The implementation of store sets for memory dependence prediction in superscalar processors. The authors, daniel chen and george h. Huang, from the university of illinois, urbana-champaign, introduce the concept of store sets as a solution to the memory dependence problem in out-of-order processors. Store sets are sets of stores that a load instruction has previously depended on. By using store sets, processors can predict when a load may be issued with reduced risk of memory violations, increasing the throughput of memory instructions. The document also includes an evaluation of the store set system and its impact on performance.

Typology: Papers

Pre 2010

Uploaded on 03/10/2009

koofers-user-ali
koofers-user-ali 🇺🇸

9 documents

1 / 3

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
12/2006
Implementation of Store Sets for Memory Dependence Prediction
Daniel Chen and George H. Huang
ECE511: Computer Architecture
University of Illinois, Urbana-Champaign, IL
Abstract
Superscalar processors introduce a number
of problems when issuing instructions out-of-order.
Among those problems are register and memory
dependencies. A memory dependence occurs when a
load reads from the same address as a previous
store. The information of the load is dependent on the
store, which means the load must be executed after
the store. One way to increase the throughput of
memory instructions is to use memory dependence
prediction using “store sets.” A store set is the set of
stores that a load instruction has previously
depended on. A processor can discover and use store
sets to predict when a load may be issued with
reduced risk of memory violations. We implemented a
store set system based on the paper by Chrysos and
Emer to investigate the advantage of using store sets
and confirmed that memory prediction using store
sets significantly reduces the number of memory
violations.
1. Introduction
While the widely accepted solution to
overcome register dependencies in out-of-order
processors is register renaming, there is no such
standard for overcoming memory dependencies.
There are two basic approaches to issuing
instructions with memory dependencies: no
speculation and naïve speculation.
A no speculation approach waits until all
previous store instructions have issued before a load
instruction is allowed to issue.
Naïve speculation issues memory
instructions as they arrive, but when stores execute,
they must check if any loads that were dependent on
them have already executed. If so, then a memory
violation has occurred, and execution must restart
from the misspeculated load.
The problem with no speculation is that
while it eliminates memory violations, it results in
poor performance. On the other hand, naïve
speculation, while providing better performance,
suffers from wasted time when recovering from
memory violations.
The goal of store sets is to use previous
memory violation history to predict memory
dependencies and aggressively issue loads by
delaying loads which have dependencies only long
enough to avoid memory violations.
2. Store Sets
A store set is a set of all the stores that a
particular load has ever depended on. This set of
stores is not known by the processor before hand, but
must be discovered.
Every time a memory violation occurs, the
store which caused the violation is added to the
dependent load’s store set. The next time the load is
fetched, the scheduler will make sure all of the stores
in the load’s store set have been issued before it will
allow the load to issue.
In the following short program, each load’s
store set is shown:
Example 2.1:
PC
0 Store A
4 Store B
8 Store C
12 Store A
16 Load A, store set[0,12]
20 Load B, store set[4]
24 Load C, store set[8]
28 Load D, store set[empty]
It is important to note that the example
shows complete store sets. In an actual
implementation, a particular store will only be added
into a store set when it causes a memory violation.
3. Implementation
The implementation of store sets we chose
to use is modeled after the paper by Chysos and
Emer.[1]
Conceptually, a store set structure would be
infinitely large and hold a store set for each load
executed, however building such a structure would
not be practical.
The idea is simplified by using a Store Set
ID Table (SSIT) and Last Fetched Store Table
(LFST). The SSIT is addressed by a hash of the
current PC for load or store instructions. The
pf3

Partial preview of the text

Download Boosting Superscalar Performance with Store Sets: Memory Dependence Prediction and more Papers Computer Architecture and Organization in PDF only on Docsity!

Implementation of Store Sets for Memory Dependence Prediction

Daniel Chen and George H. Huang

ECE511: Computer Architecture

University of Illinois, Urbana-Champaign, IL

Abstract Superscalar processors introduce a number of problems when issuing instructions out-of-order. Among those problems are register and memory dependencies. A memory dependence occurs when a load reads from the same address as a previous store. The information of the load is dependent on the store, which means the load must be executed after the store. One way to increase the throughput of memory instructions is to use memory dependence prediction using “store sets.” A store set is the set of stores that a load instruction has previously depended on. A processor can discover and use store sets to predict when a load may be issued with reduced risk of memory violations. We implemented a store set system based on the paper by Chrysos and Emer to investigate the advantage of using store sets and confirmed that memory prediction using store sets significantly reduces the number of memory violations.

1. Introduction While the widely accepted solution to overcome register dependencies in out-of-order processors is register renaming, there is no such standard for overcoming memory dependencies. There are two basic approaches to issuing instructions with memory dependencies: no speculation and naïve speculation. A no speculation approach waits until all previous store instructions have issued before a load instruction is allowed to issue. Naïve speculation issues memory instructions as they arrive, but when stores execute, they must check if any loads that were dependent on them have already executed. If so, then a memory violation has occurred, and execution must restart from the misspeculated load. The problem with no speculation is that while it eliminates memory violations, it results in poor performance. On the other hand, naïve speculation, while providing better performance, suffers from wasted time when recovering from memory violations.

The goal of store sets is to use previous memory violation history to predict memory

dependencies and aggressively issue loads by delaying loads which have dependencies only long enough to avoid memory violations.

2. Store Sets A store set is a set of all the stores that a particular load has ever depended on. This set of stores is not known by the processor before hand, but must be discovered. Every time a memory violation occurs, the store which caused the violation is added to the dependent load’s store set. The next time the load is fetched, the scheduler will make sure all of the stores in the load’s store set have been issued before it will allow the load to issue. In the following short program, each load’s store set is shown:

Example 2.1: PC 0 Store A 4 Store B 8 Store C 12 Store A

16 Load A, store set[0,12] 20 Load B, store set[4] 24 Load C, store set[8] 28 Load D, store set[empty]

It is important to note that the example shows complete store sets. In an actual implementation, a particular store will only be added into a store set when it causes a memory violation.

3. Implementation The implementation of store sets we chose to use is modeled after the paper by Chysos and Emer.[1] Conceptually, a store set structure would be infinitely large and hold a store set for each load executed, however building such a structure would not be practical. The idea is simplified by using a Store Set ID Table (SSIT) and Last Fetched Store Table (LFST). The SSIT is addressed by a hash of the current PC for load or store instructions. The

Figure 3.1: Store Set Implementation

Loads/Stores index into the SSIT to obtain their SSID. The SSID indexes into the LFST to obtain the inum of the last store to be dependent on.

Table 4. SSIT is filled with Store Set IDs (SSID). The SSID is an ID that identifies a particular store set. The SSID indexes to a store inum in the LFST. This store inum identifies the last store that was fetched. By keeping track of the store inums, we can greatly simplify the dependence chain.

Assignment of SSIDs are governed by four rules defined by Chysos and Emer.

  1. If neither the load nor store has been assigned an SSID, one is allocated and assigned to both
  2. If only the load has been assigned, the store is assigned the load’s SSID
  3. If only the store has been assigned an SSID, the load is assigned the store’s SSID
  4. If the load and store have different SSIDs, an arbiter must be used to assign one’s SSID to the other. The arbiter must always choose the same SSID given two different SSIDs. This way, one store set will eventually be converted to members of the other store set. This is called store set merging.

3.1 Simulator The simulator we had available was the ECE511 simulator. It is an out-of-order issue, 1-wide pipeline, with ghist type branch predictor and load- store queues. It also has variable issue buffer and reorder buffer. It takes the naïve approach to memory speculation. We constructed the SSIT and LFST inside of the scheduler stage. In-flight instructions are uniquely identified by the time-stamp that is already provided in the instruction data structure. We added two fields to the instruction data structure to keep track if the current instruction depends on another instruction and that instruction’s time stamp.

Dependencies are enforced by checking if the depended on store is in the store queue and has been issued, or if the retired instruction time stamp is greater than the depended on instruction. If either of the above case is satisfied, then we can safely issue the current memory instruction.

4. Data Analysis We evaluated the performance advantage of store sets by running different benchmarks with the original simulator, which implements naïve speculation and then with our modified simulator implementing memory prediction with store sets. Data showed an improvement in the number of memory violations across all benchmarks tested. The most gains were in lzw, with a 21% IPC increase. Parser, which had very few memory violations to begin with, showed almost no increase in IPC. We also studied the optimal time to clear the SSIT table to reduce the aliasing effect. Specifically, we want to see if there is a correlation between the aliasing effect and the fullness of the SSIT table. We tested the performance by resetting the SSIT table when the table reaches 70%, 75%, 80%, 85%, 90%, and 95% full respectively. The result indicated that the performance actually decreases for bzip benchmark when we resetting the SSIT table. In Graph 3 we see that the memory violation increases

Naïve Bzip2 gcc lzw mcf parser mem viol 4670 503 11691 2059 12 percent 3.01% 2.01% 9.87% 1.16% 0.04% Store Sets mem viol 128 72 7 23 4 percent 0.08% 0.04% 0.01% 0.01% 0.01%