Multicomputer distributed system LECTURE 8

Multicomputer distributed system LECTURE 8 DR. SAMMAN H. AMEEN 1

Wide area network (WAN); A WAN connects a large number of computers that are spread over large geographic distances. It can span sites in multiple cities, countries, and continents. Metropolitan area network (MAN); The MAN is an intermediate level between the LAN and WAN and can perhaps span a single city. Local area network (LAN); A LAN connects a small number of computers in a small area within a building or campus. System or storage area network (SAN). A SAN connects computers or storage devices to make a single system. PAGE 2

A network channel c=(x,y) is characterized by width wc: the number of parallel signals it contains, frequency fc: the rate at which bits are transported at each signal latency tc is the time required for a bit to travel from x to y. A bandwidth of a channel is W= wc * fc. The throughput Θ of a network is the data rate in bits per second that network accepts per input port. Under a particular traffic pattern, the channel that carries the largest fraction of the traffic determines the maximum channel load γ. Load on the channel can be equal or smaller than channel bandwidth. Θ=W/γ PAGE 3

Deterministic: The simplest algorithm - for each source, destination pair, there is a single path. This routing algorithm usually achieves poor performance because it fails to use alternative routes, and concentrates traffic on only one set of channels. Oblivious: So named because it ignores the state of the network when determining a path. Unlike deterministic, it considers a set of paths from a source to a destination, and chooses between them. Adaptive: The routing algorithm changes based on the state of the network. PAGE 4

Message: logical unit for internode communication Packet: basic unit containing destination address for routing Packets have sequencing # for reassembly Flits: flow control digits of packets PAGE 5

Header flits contain routing information and sequence number Flit length affected by network size Packet length determined by routing scheme and network implementation Lengths also dependent on channel b/w, router design, network traffic, etc. PAGE 6

100 Gigabit Ethernet (100GbE) and 40 Gigabit Ethernet (40GbE) are groups of computer networking technologies for transmitting Ethernet frames at rates of 100 and 40 gigabits per second (100 and 40 Gbit/s), respectively. The technology was first defined by the IEEE 802.3ba-2010 standard. InfiniBand (abbreviated IB) is a computer network communications link used in high-performance computing featuring very high throughput and very low latency. It is used for data interconnect both among and within computers. As of 2014 it is the most commonly used interconnect in supercomputers. PAGE 7

SDR Single Data Rate 2.5GB/S * 4 = 10 DDR Double Data Rate 5 GB/S * 4 = 20 QDR Quadruple Data Rate 10GB/S * 4 = 40 FDR Fourteen Data Rate 14 Gb/s * 4 = 56 EDR Enhanced Data Rate 25 Gb/s * 4 = 100 HDR - High Data Rate NDR - Next Data Rate PAGE 8

Latency is an element that contributes to network speed. The term latency refers to any kind of delay typically incurred in processing of network data. A low latency connection is one that generally experiences small delay times, while a high latency connection generally suffers from long delays. PAGE 9

PAGE 10

Theoretical effective throughput, Characteristics SDR DDR QDR FDR-10 FDR EDR HDR NDR 2 4 8 10 14 25 50 Gbit/s, per 1x Speeds for 4x and 12x (Gbit/s) 8, 24 16, 48 32, 96 41.25, 123.75 54.54, 163.64 100, 300 200, 600 Latency (microseconds) [ 5 2.5 1.3 0.7 0.7 0.5 Year 2001, 2003 2005 2007 2011 2014 ~2017 after 2020 PAGE 11

In the context of parallel computing, granularity is the ratio of communication time over computation time. Fine grain parallelism is characterized by seemingly more communications as the relative computation time is shorter. Coarse grain parallelism, then, is characterized by seemingly fewer communications with much longer computation time. Load balance is easier to achieve with fine grain parallelism because small tasks depend less on the operating system, interrupts and so on. Coarse grain parallelism, on the converse, makes it harder to predict when any given task will terminate, therefore making it harder to assign tasks for optimal usage of the multiple processors. Fine grain parallelism requires more synchronization overhead due to the need to communicate data and synchronize tasks among processors. Therefore, the fewer communications in coarse grain parallelism reduces overhead. PAGE 12

Greedy: Always send the packet in the shortest direction around the ring. For example, always route from 0 to 3 in the clockwise direction and from 0 to 5 in the counterclockwise direction. If the distance is the same in both directions, pick a direction randomly. Uniform random: Randomly pick a direction for each packet, with equal probability of picking either direction. Weighted random: Randomly pick a direction for each packet, but weight the short direction with probability 1 - Δ /N and the long direction with Δ/N, where Δ is the (minimum) distance between the source and destination. N is number of links Adaptive: Send the packet in the direction for which the local channel has the lowest load. We may approximate load by either measuring the length of the queue serving this channel or recording how many packets it has transmitted over the last T slots. PAGE 13

Circuit switching A circuit path is established a priori and torn down after use Routing, arbitration, switching performed once for train of packets Reduces latency and overhead Can be highly wasteful of scarce network bandwidth Links and switches go under utilized during path establishment and tear-down if no train of packets follows circuit set-up 14

Buffers for request tokens Source Destination PAGE 15

Buffers for request tokens Source Request for circuit establishment (routing and arbitration is performed during this step) Destination PAGE 16

Buffers for ack tokens Source Request for circuit establishment Destination Acknowledgment and circuit establishment (as token travels back to the source, connections are established) PAGE 17

Source Request for circuit establishment Acknowledgment and circuit establishment Destination Packet transport (neither routing nor arbitration is required) PAGE 18

X Source HiRequest for circuit establishment Acknowledgment and circuit establishment Destination Packet transport High contention, low utilization (r) low throughput PAGE 19

Routing, arbitration, switching is performed on a per-packet basis Sharing of network link bandwidth is done on a per-packet basis More efficient sharing and use of network bandwidth by multiple flows if transmission of packets by individual sources is more intermittent Store-and-forward switching Bits of a packet are forwarded only after entire packet is first stored Packet transmission delay is multiplicative with hop count, d Cut-through switching Bits of a packet are forwarded once the header portion is received Packet transmission delay is additive with hop count, d Virtual cut-through: flow control is applied at the packet level Wormhole: flow control is applied at the flow unit (flit) level PAGE 20

Buffers for data packets Store Source Destination Packets are completely stored before any portion is forwarded PAGE 21

Requirement: buffers must be sized to hold entire packet (MTU) Forward Store Source Destination Packets are completely stored before any portion is forwarded PAGE 22

CUT-THROUGH Routing Source Destination Portions of a packet may be forwarded ( cut-through ) to the next switch before the entire packet is stored at the current switch

VIRTUAL CUT-THROUGH Buffers for data packets Requirement: buffers must be sized to hold entire packet (MTU) Source Wormhole Buffers Destination for flits: packets can end be node larger than buffers Source

VIRTUAL CUT-THROUGH Buffers for data packets Requirement: buffers must be sized to hold entire packet (MTU) Busy Link Source Wormhole Packet completely stored at the switch Buffers Destination for flits: packets can end be node larger than buffers Busy Link Packet stored along the path Source Maximizing sharing of link BW increases r ( i.e., r S ) Destination