Lecture 25: Interconnection Networks, Disks. Topics: flow control, router microarchitecture, RAID

Lecture 25: Interconnection Networks, Disks Topics: flow control, router microarchitecture, RAID 1

Virtual Channel Flow Control Each switch has multiple virtual channels per phys. channel Each virtual channel keeps track of the output channel assigned to the head, and pointers to buffered packets A head flit must allocate the same three resources in the next switch before being forwarded By having multiple virtual channels per physical channel, two different packets are allowed to utilize the channel and not waste the resource when one packet is idle 2

Example Wormhole: Node-0 A is going from Node-1 to Node-4; B is going from Node-0 to Node-5 Node-1 A Virtual channel: B idle B idle Node-2 Node-3 Node-4 Node-5 (blocked, no free VCs/buffers) Node-0 Traffic Analogy: B is trying to make a left turn; A is trying to go straight; there is no left-only lane with wormhole, but there is one with VC Node-1 B A A A B Node-2 Node-3 Node-4 Node-5 (blocked, no free VCs/buffers) 3

Buffer Management Credit-based: keep track of the number of free buffers in the downstream node; the downstream node sends back signals to increment the count when a buffer is freed; need enough buffers to hide the round-trip latency On/Off: the upstream node sends back a signal when its buffers are close to being full reduces upstream signaling and counters, but can waste buffer space 4

Deadlock Avoidance with VCs VCs provide another way to number the links such that a route always uses ascending link numbers 17 18 19 2 1 0 18 1 2 3 2 1 0 17 1 2 3 2 1 0 16 1 2 3 2 1 0 117 118 119 102 101 100 118 117 101 102 103 Alternatively, use West-first routing on the 1 st plane and cross over to the 2 nd plane in case you need to go West again (the 2 nd plane uses North-last, for example) 116 202 201 200 217 218 218 217 201 202 203 219 216 5

Router Functions Crossbar, buffer, arbiter, VC state and allocation, buffer management, ALUs, control logic Typical on-chip network power breakdown: 30% link 30% buffers 30% crossbar 6

Virtual Channel Router Buffers and channels are allocated per flit Each physical channel is associated with multiple virtual channels the virtual channels are allocated per packet and the flits of various VCs can be interweaved on the physical channel For a head flit to proceed, the router has to first allocate a virtual channel on the next router For any flit to proceed (including the head), the router has to allocate the following resources: buffer space in the next router (credits indicate the available space), access to the physical channel 7

Router Pipeline Four typical stages: RC routing computation: the head flit indicates the VC that it belongs to, the VC state is updated, the headers are examined and the next output channel is computed (note: this is done for all the head flits arriving on various input channels) VA virtual-channel allocation: the head flits compete for the available virtual channels on their computed output channels SA switch allocation: a flit competes for access to its output physical channel ST switch traversal: the flit is transmitted on the output channel A head flit goes through all four stages, the other flits do nothing in the first two stages (this is an in-order pipeline and flits can not jump ahead), a tail flit also de-allocates the VC 8

Router Pipeline Four typical stages: RC routing computation: compute the output channel VA virtual-channel allocation: allocate VC for the head flit SA switch allocation: compete for output physical channel ST switch traversal: transfer data on output physical channel Cycle 1 2 3 4 5 6 7 STALL Head flit Body flit 1 RC VA SA ST -- -- SA ST RC VA SA SA ST -- -- -- SA ST Body flit 2 -- -- SA ST -- -- -- SA ST Tail flit -- -- SA ST -- -- -- SA ST 9

Speculative Pipelines Perform VA and SA in parallel Note that SA only requires knowledge of the output physical channel, not the VC If VA fails, the successfully allocated channel goes un-utilized Cycle 1 2 3 4 5 6 7 Perform VA, SA, and ST in parallel (can cause collisions and re-tries) Typically, VA is the critical path can possibly perform SA and ST sequentially Head flit Body flit 1 RC VA SA ST -- SA ST RC VA SA ST SA ST Body flit 2 -- SA ST SA ST Tail flit -- SA ST SA ST Router pipeline latency is a greater bottleneck when there is little contention When there is little contention, speculation will likely work well! Single stage pipeline? 10

Recent Intel Router Used for a 6x6 mesh 16 B, > 3 GHz Wormhole with VC flow control Source: Partha Kundu, On-Die Interconnects for Next-Generation CMPs, talk at On-Chip Interconnection Networks Workshop, Dec 2006 11

Recent Intel Router Source: Partha Kundu, On-Die Interconnects for Next-Generation CMPs, talk at On-Chip Interconnection Networks Workshop, Dec 2006 12

Recent Intel Router Source: Partha Kundu, On-Die Interconnects for Next-Generation CMPs, talk at On-Chip Interconnection Networks Workshop, Dec 2006 13

Magnetic Disks A magnetic disk consists of 1-12 platters (metal or glass disk covered with magnetic recording material on both sides), with diameters between 1-3.5 inches Each platter is comprised of concentric tracks (5-30K) and each track is divided into sectors (100 500 per track, each about 512 bytes) A movable arm holds the read/write heads for each disk surface and moves them all in tandem a cylinder of data is accessible at a time 14

Disk Latency To read/write data, the arm has to be placed on the correct track this seek time usually takes 5 to 12 ms on average can take less if there is spatial locality Rotational latency is the time taken to rotate the correct sector under the head average is typically more than 2 ms (15,000 RPM) Transfer time is the time taken to transfer a block of bits out of the disk and is typically 3 65 MB/second A disk controller maintains a disk cache (spatial locality can be exploited) and sets up the transfer on the bus (controller overhead) 15

RAID Reliability and availability are important metrics for disks RAID: redundant array of inexpensive (independent) disks Redundancy can deal with one or more failures Each sector of a disk records check information that allows it to determine if the disk has an error or not (in other words, redundancy already exists within a disk) When the disk read flags an error, we turn elsewhere for correct data 16

RAID 0 and RAID 1 RAID 0 has no additional redundancy (misnomer) it uses an array of disks and stripes (interleaves) data across the arrays to improve parallelism and throughput RAID 1 mirrors or shadows every disk every write happens to two disks Reads to the mirror may happen only when the primary disk fails or, you may try to read both together and the quicker response is accepted Expensive solution: high reliability at twice the cost 17

RAID 3 Data is bit-interleaved across several disks and a separate disk maintains parity information for a set of bits For example: with 8 disks, bit 0 is in disk-0, bit 1 is in disk-1,, bit 7 is in disk-7; disk-8 maintains parity for all 8 bits For any read, 8 disks must be accessed (as we usually read more than a byte at a time) and for any write, 9 disks must be accessed as parity has to be re-calculated High throughput for a single request, low cost for redundancy (overhead: 12.5%), low task-level parallelism 18

RAID 4 and RAID 5 Data is block interleaved this allows us to get all our data from a single disk on a read in case of a disk error, read all 9 disks Block interleaving reduces thruput for a single request (as only a single disk drive is servicing the request), but improves task-level parallelism as other disk drives are free to service other requests On a write, we access the disk that stores the data and the parity disk parity information can be updated simply by checking if the new data differs from the old data 19

RAID 5 If we have a single disk for parity, multiple writes can not happen in parallel (as all writes must update parity info) RAID 5 distributes the parity block to allow simultaneous writes 20

RAID Summary RAID 1-5 can tolerate a single fault mirroring (RAID 1) has a 100% overhead, while parity (RAID 3, 4, 5) has modest overhead Can tolerate multiple faults by having multiple check functions each additional check can cost an additional disk (RAID 6) RAID 6 and RAID 2 (memory-style ECC) are not commercially employed 21

Title Bullet 22