









Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
EE482B: Advanced Computer Organization. Lecture #4. Interconnection Networks Architecture and Design. Stanford University. TOPOLOGY.
Typology: Slides
1 / 15
This page cannot be seen from the preview
Don't miss anything!










(Packaging Torus, Slicing, Distributors, Concentrators and Non-Blocking Networks)
Lecture #4: Monday, April 16, 2001 Lecturer: Prof. William J. Dally Scribe: Lesley Chang, Suresh Sankaralingam, Prashant Nedungadi Reviewer: Kelly Shaw
Topics Covered in this lecture
1.1 Research Paper Assignment
The Research Paper assignment is posted on the web. It involves the following major activities:
The Research papers are due by May 16, 2001.
1.2 Project Assignment
The Project assignment is a much more involved task. It involves the following major activities:
There are two checkpoints and a final report. See the Project handout for more details.
2.1 How to package Torus Networks?
Torus networks are easy to map to physical space. But, since they have one long end-around channel, from node k-1 to and from node 0 in each dimension they will have a length of k for the end-around channels. Such long channels may result in excessive latency or require slower signal- ling rates and causes non-uniform delay.
This problem can be solved by folding a torus network as shown in Figure 1. The first half of the nodes are packaged on top and the second half of the nodes are packaged on the bottom. By fold- ing a k-ary n-cube torus network, in each plane, the distance between k-2 nodes is 2 and 2 nodes is
Figure 1. Examples of folded torus networks
By folding, the minimum distance between the nodes increase. But, the overall wire length is reduced between nodes. Also, the placement is easier. Depending on the topology, there is a trade-off between choosing a folded network or not.
0 3
1 2
0 1 2 3 (a) unfolded 4-ary 1-cube
(b) folded 4-ary 1-cube
00 30
10 20
03 33
13 23
01 31
11 21
02 32
12 22
There are 3 major ways to do slicing.
3.1 Bit Slicing
Bit slicing is the most straightforward method of dividing a network node across multiple chips. The example shown below has an 8-bit 2-D node being sliced into two 4-bit wide modules. Each of the 8 channels are divided into a pair of 4-bit channels. A number of control lines (ctl) are used to pass information between the two router bit slices.
Figure 4. Example of bit-slicing for a 2-D torus network with w=
Consider a 8-bit channel (0:7). It can be split into two 4-bit channels(0:3 and 4:7). In that case, what would be the structure of the payload and the header information in the packet? One option is that they can be duplicated. Each slice will carry a copy of the header as shown in the following diagram.
Figure 5. Duplication of headers in packet for bit slicing
Network Node ei[0:7] [0:7] eo[0:7]
wo[0:7] wi[0:7]
ni[0:7]
si[0:7] no[0:7]
so[0:7]
Network Node ei[4:7] [4:7]
wo[4:7]
si[4:7]no[4:7]
so[4:7]ni[4:7]
Network Node [0:3] eo[0:3]
wi[0:3]
so[0:3]ni[0:3]
ei[0:3]
wo[0:3]
eo[4:7]
wi[4:7]
si[0:3]no[0:3]
(a) (^) (b)
ctl
vc vc dest dest dest dest
Payload
The disadvantage of this method is that for small packet sizes, the overhead in the packet is multiplied by the number of slices which results in increased serialization latency. Some of the alternatives for reducing the effect of the multiple headers are shown below.
One option of pipelining the header latency is to have header connected to master and payload connected to both master and slave. The header proceeds the payload on the master side, leaving room at the end of the current payload to store the next message, as in Fig 6(a).The slave buffers up the payload until it receives the control from master.
Another option is to send the header first as in Figure 6(c) and then follow it up with payload. Dally has seen bit slicing where they leave a gap between the header and the payload and then shape the payload in such a way that the next header can be processed in a pipelined manner as in Figure 6(d).
One more option is to split the header information as in Figure 6(b) and then process them in the bit slices and exchange the control information. When the 2 subnodes work in concert, both mak- ing the same decisions, there is no problem. But, when there are errors, they are no longer in lock- step.Also,, in this case, the control decisions have to be distributed from the master to the slave and the chip crossing causes an increase in per hop latency and the serialization latency also increases.
Figure 6. Header Pipelining and Split header for bit slicing
Payload H1 H
Master Slave Master Slave
(a)
(b)
Master Slave
Header
Payload Payload
Header
Master Slave
Header
Payload
(c) (d)
Figure 9. Example of dimension sliced network
3.3 Channel Slicing
In the channel sliced network, one wide channel can be broken down into multiple narrow chan- nels which are completely separate networks. In channel slicing, there is no duplication of header information and no communication between the partitions. Channel slicing as shown in Figure 10 (a) and (b) effectively doubles the serialization latency while reducing the cross connect logic between two packets.
Figure 10. Examples of Channel Sliced Network
By doing channel slicing, it is clear that you end up with 2 halfwide network nodes. Channel slic- ing has fault tolerance, but there could also be load balance issues if the distributor doesn’t distrib- ute the traffic between the two half-wide networks properly. Another problem with channel slicing is that since bandwidth is halved, the serialization latency is doubled.
Network Node north/south
ei[0:7] eo[0:7]
wo[0:7] wi[0:7]
ni[0:7]
si[0:7]no[0:7]
so[0:7]
Network Node east/west
Distributor
Network 1
Network 2
2P 2P
Net 1 Net 2
(a) (b)
Terminal
3.4 Summary of Slicing: Any time you slice a node up, you pay for it one way or another.
4.1 Distributor:
A distributor takes one high bandwidth channel and distributes its packets over several lower bandwidth channels. Distributors provide scalability when the nodes of the network move towards handling higher bandwidths than what the link currently supports. In the example given below the 10 Gbps node is distributed into two 5-Gbps links
Figure 11. Examples of Distributors
Why use distributors?
5Gbps 5Gbps
10 Gbps
When we do concentration, the one factor we are considering is that the network operates at a maximum peak bandwidth which is equal to the peak bandwidth of one terminal, considering that the other terminals would not transmit during that time. We can probabilistically prove that the probability of the conflict between two or more terminals is very small.
In Figure 13 (a), we find that the probability of only one terminal transmitting at 10Gbps is 0. and the probability of zero traffic is 0.9. When 2 terminals transmit, the probability of no traffic drops to 0.9*0.9 = 0.81 as shown in Figure 13(b). The probability that only one node transmits becomes 0.18 and the probability that 2 nodes transmit (conflict case) is 0.01. This can be extended to the case where three nodes transmit as shown in Figure 13(c). As you increase the number of concentration, the probability of collision increases.
Figure 13. Conflict probabilities in concentrators
The probability that one of the node transmits at peak bandwidth when the other nodes transmit at peak bandwidth is very small. There is a Cost-Performance trade-off when we try to switch to a concentrator/distributor based network. You can either fix the cost, and look at performance or fix the performance, and look at the cost. Additionally, the concentrator and the network node are usually placed on the same package.
5. NON-BLOCKING NETWORKS
Blocking or Non-Blocking behavior of a network is essentially a “circuit switching concept”. If the network is blocking, then rearrangement is needed. The rearrangement is done in response to protection switching. If a network line is lost because somebody ripped it out of the ground, you have to make sure that the connection is re-established within a small amount of time. Since peo- ple don’t have enough time to compute new arrangement of routes, non-blocking is used.Non- Blocking Networks are categorized into 2 categories
Rearrangeable networks can also have hitless switching.
10 Gbps
10Gbps 20Gbps
(a) 1 node (b) 2 nodes (c) 3 nodes
30Gbps
Hitless switching: cannot tell that you switched in a frame boundary; no bits were lost on the switchover.
Looking from the ports perspective, rearrangeable with hitless switching is indistinguishable from strictly non-blocking. However, rearrangeable routes need to be computed off-line and in some situations, there may not be enough time to compute the new routes.
5.1 Crossbar Networks
The crossbar is a strictly non-blocking network with stiff flow control. An nxm crossbar or cross- point switch directly connects n inputs to m outputs with no intermediate stages. In effect, such a switch consists of m n:1 multiplexers, one for each output. Many crossbar networks are square in that m=n. Others are rectangular with m > n or m < n.
Figure 14. Relay based crossbar switch
In Figure 14, we have a relay based crossbar. Here, at each cross connect, you have a switch. When the relays are closed, then they form a path between the input and output ports connected to that switch. By closing more than one switch, the same input port can transmit to multiple output ports. This feature is used in multicasting.
in
in
in
in
out0 out1 out2 out3 out
input line crosspoint
output line
A knxkn crossbar can be constructed from k^2 nxn crossbars as shown in the Figure 16. The amount of cross bar switches needed are knxkn which grows as n^2. The fan-in and fan-out logic will increase by nlog(n). Effectively, the number of crossbar switches dominates the cost. The crossbar is pin limited by the above problem. But, if the entire crossbar can fit on the chip, then you are all set.
Scheduling crossbars is a very interesting topic, which we will talk about later this quarter. Although the idea of connecting in to out is easy, this is not a trivial topic and we will find out why in a future lecture.
5.2 Clos Networks
In a Clos network, each middle-stage switch has one input link from every input switch and one output link to every output switch. Thus, the r input switches are n x m crossbars to connect n input ports to m middle switches, the m middle switches are r x r crossbars to connect r input switches to r output switches, and the r output switches are m x n crossbars to connect m middle switches to n output ports. This is shown clearly in Figure 15.
Figure 17. A (m=3, n=3, r=4) symmetric Clos network
A symmetric Clos network is characterized by (m, n, r) where:
m: the number of middle-stage switches n: the number if input (output) ports on each input (output) switch r: the number of input and output switches.
This network has m-fold path diversity because there is 1 path through each middle switch. It is strictly non-blocking if m >= 2n - 1. It is because when all others are transmitting, then there are 2(n-1) switches busy. In order to route the call, we need at least one free middle stage. So, we need at least one more than 2(n-1) in order for the network to be non-blocking.
n=3 ports per switch
middle switch 1 4x
m=3 r x r middle switches
middle switch 2 4x
middle switch 3 4x
input switch 1 3x
r=4 n x m input switches
input switch 2 3x
input switch 3 3x
input switch 4 3x
output switch 1 3x
r=4 m x n output switches
output switch 2 3x
output switch 3 3x
output switch 4 3x
The model being
Stage 1 : 1111111...111000000000... <---n-1---> Stage 2 : 0000000...0000111111... ^<---n-1---> | available We therefore need 2(n-1) + 1 middle stage switches.This is strictly non-blocking because there is no way for an adversary to concentrate traffic because traffic can go over any middle switch (and there will always be a free middle switch).
Figure 18. A (5,3,4) Clos network
As we derived the condition for a clos network to be strictly non-blocking, we know that the net- work shown in Figure 17 is not a strictly non-blocking network since m = 3 is not greater than 2n- 1 (5). The network shown in Figure 18 is strictly non-blocking for unicast traffic since it satisfies the condition (m > 2n-1). So, at any point of time, you can always find a free middle stage to switch the incoming traffic.
For a complete routing example, consider the permutation {5,7,11,6,12,1,8,10,3,2,9,4}. That is, input (1.1) routes to output 5 (2.2), input 2 (1.2) to output 7 (3.1) and so on. Since we need only consider switches and not ports, we can simplify this to {(2,3,4), (2,4,1), (3,4,1), (1,3,2)}. Input switch has connections to output switches 2,3 and 4 as you can see from Table 1. The first 3 col- umns show the input switch from which the call is coming (In), the output switch to which the call is going (Out), the middle switch allocated to the call (Middle). The remaining eight columns give bit vectors showing which middle switches are free from each input and output switch.
n=3 ports per switch middle switch 2 4x
m=5 r x r middle switches
middle switch 3 4x
middle switch 4 4x
r=4 n x m input switches
r=4 m x n output switches
middle switch 1 4x
middle switch 5 4x
input switch 1 3x
input switch 2 3x
input switch 3 3x
input switch 4 3x
output switch 1 5x
output switch 2 5x
output switch 3 5x
output switch 4 5x