Supercomputer Architecture

Size: px

Start display at page:

Download "Supercomputer Architecture"

Agatha Jefferson
6 years ago
Views:

1 Prof. Thomas Sterling Pervasive Technology Institute School of Informatics & Computing Indiana University Dr. Steven R. Brandt Center for Computation & Technology Louisiana State University Basics of Supercomputing Supercomputer Architecture

2 Topics Overview of Supercomputer Architecture Enabling Technologies SMP Memory Hierarchy Commodity Clusters System Area Networks 2

3 Topics Overview of Supercomputer Architecture Enabling Technologies SMP Memory Hierarchy Commodity Clusters System Area Networks 3

4 New Fastest Computer in the World DEPARTMENT OF COMPUTER LOUISIANA STATE UNIVERSITY 4

5 Supercomputing System Stack Device technologies Enabling technologies for logic, memory, & communication Circuit design Computer architecture semantics and structures Models of computation governing principles Operating systems Manages resources and provides virtual machine Compilers and runtime software Maps application program to system resources, mechanisms, and semantics Programming languages, tools, & environments Algorithms Numerical techniques Means of exposing parallelism Applications End user problems, often in sciences and technology 5

6 Classes of Architecture for High Performance Computers Parallel Vector Processors (PVP) NEC Earth Simulator, SX-6 Cray-1, 2, XMP, YMP, C90, T90, X1 Fujitsu 5000 series Massively Parallel Processors (MPP) Intel Touchstone Delta & Paragon TMC CM-5 IBM SP-2 & 3, Blue Gene/Light Cray T3D, T3E, Red Storm/Strider Distributed Shared Memory (DSM) SGI Origin HP Superdome Single Instruction stream Single Data stream (SIMD) Goodyear MPP, MasPar 1 & 2, TMC CM-2 Commodity Clusters Beowulf-class PC/Linux clusters Constellations HP Compaq SC, Linux NetworX MCR 6

density Memory capacity and access time Communications

issue rate Execution pipelining Reservation stations

operations per cycle per processor Instruction level

7 Where Does Performance Come From? Device Technology Logic switching speed and device density Memory capacity and access time Communications bandwidth and latency Computer Architecture Instruction issue rate Execution pipelining Reservation stations Branch prediction Cache management Parallelism Number of operations per cycle per processor Instruction level parallelism (ILP) Vector processing Number of processors per node Number of nodes in a system 7

8 Top 500 : System Architecture 8

Driving Issues/Trends Multicore Now: 8, AMD Opterons, Intel Xeon possibly 100 s will be million-way parallelism Heterogeneity GPGPU Clearspeed Cell SPE Component I/O Pins Off chip bandwidth not

9 Driving Issues/Trends Multicore Now: 8, AMD Opterons, Intel Xeon possibly 100 s will be million-way parallelism Heterogeneity GPGPU Clearspeed Cell SPE Component I/O Pins Off chip bandwidth not increasing with demand Limited number of pins Limited bandwidth per pin (pair) Cache size per core may decline Shared cache fragmentation System Interconnect Node bandwidth not increasing proportionally to core demand Power Mwatts at the high end = millions of $s per year 9

Multi-Core Motivation for Multi-Core Exploits improved feature-size and density Increases functional

processor complexity Challenges resulting from multi-core Relies on effective exploitation of

Aggravates memory wall Memory bandwidth Way to get data out of memory banks Way to get data into

Rate of pin growth projected to slow and flatten Rate of bandwidth per pin (pair) projected to grow

10 Multi-Core Motivation for Multi-Core Exploits improved feature-size and density Increases functional units per chip (spatial efficiency) Limits energy consumption per operation Constrains growth in processor complexity Challenges resulting from multi-core Relies on effective exploitation of multiple-thread parallelism Need for parallel computing model and parallel programming model Aggravates memory wall Memory bandwidth Way to get data out of memory banks Way to get data into multi-core processor array Memory latency Fragments (shared) L3 cache Pins become strangle point Rate of pin growth projected to slow and flatten Rate of bandwidth per pin (pair) projected to grow slowly Requires mechanisms for efficient inter-processor coordination Synchronization Mutual exclusion Context switching 10

Heterogeneous Multicore Architecture Combines different types of processors Each optimized for a different operational modality Performance > Nx better than

processing units (GPU) Network controllers (NIC) Efforts underway to apply existing special purpose components to general applications Purpose-designed

11 Heterogeneous Multicore Architecture Combines different types of processors Each optimized for a different operational modality Performance > Nx better than other N processor types Synthesis favors superior performance For complex computation exhibiting distinct modalities Conventional co-processors Graphical processing units (GPU) Network controllers (NIC) Efforts underway to apply existing special purpose components to general applications Purpose-designed accelerators Integrated to significantly speedup some critical aspect of one or more important classes of computation IBM Cell architecture ClearSpeed SIMD attached array processor 11

12 Topics Overview of Supercomputer Architecture Enabling Technologies SMP Memory Hierarchy Commodity Clusters System Area Networks 12

vacuum tubes 1940s Transistors 1947 Core memory 1950 SSI & MSI

13 Major Technology Generations (dates approximate) Electromechanical 19 th century through 1 st half of 20 th century Digital electronic with vacuum tubes 1940s Transistors 1947 Core memory 1950 SSI & MSI RTL/DTL/TTL semiconductor 1970 DRAM 1970s CMOS VLSI 1990 Multicore

of transistors that can be placed inexpensively on an

14 Moore s Law Moore's Law describes a longterm trend in the history of computing hardware, in which the number of transistors that can be placed inexpensively on an integrated circuit has doubled approximately every two years. 14

15 The SIA ITRS Roadmap 1 0 0, M B p e r D R A 1 0, L0 o0 g0 i c T r a n s u P C l o c k ( M 1, Y e a r o f T e c 15

16 Impact of VLSI Mass produced microprocessor enabled low cost computing PCs and workstations Economy of scale Ensembles of multiple processors Microprocessor becomes building block of parallel computers Favors sequential process oriented computing Natural hardware supported execution model Requires locality management Data Control I/O channels (south bridge) provides external interface Coarse grained communication packets Suggests concurrent execution at the process boundary level Processes statically assigned to processors (one on one) Operate on local data Coordination by large value-oriented I/O messages Inter process/processor synchronization and remote data exchange 16

Classical DRAM Gbits per chip 1000 100 10 1 0.1 0.01 0.001 0.0001 0.00001 0.

17 Classical DRAM Gbits per chip Historical Production Introduction Memory mats: ~ 1 Mbit each Row Decoders Primary Sense Amps Secondary sense amps & page multiplexing Timing, BIST, Interface Kerf % Chip Overhead Historical SIA Production SIA Introduction Density/Chip has dropped below 4X/3yrs And 45% of Die is Non-Memory 17

18 Microprocessors no longer realize the full potential of VLSI technology 1e+7 1e+6 1e+5 1e+4 Perf (ps/inst) Linear (ps/inst) 1e+3 1e+2 1e+1 1e+0 30:1 1,000:1 1e-1 1e-2 1e-3 30,000:1 1e Courtesy of Bill Dally, Nvidia 18

19 The Memory Wall T im Memory Access Time CPU Time Ratio M e m THE WALL 19

1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Performance Recap: Who Cares About the Memory Hierarchy?

20 Performance Recap: Who Cares About the Memory Hierarchy? Processor-DRAM Memory Gap (latency) 1000 Moore s Law CPU µproc 60%/yr. (2X/1.5yr) Processor-Memory Performance Gap: (grows 50% / year) DRAM DRAM 9%/yr. (2X/10 yrs) Time Copyright 2001, UCB, David Patterson 20

21 Topics Overview of Supercomputer Architecture Enabling Technologies SMP Memory Hierarchy Commodity Clusters System Area Networks 21

Symmetric Multi Processor (SMP usually cache coherent ) Orion JPL NASA CPU 1 CPU 2

22 Shared Memory Multiple Thread Static or dynamic Fine grained OpenMP Distributed shared memory systems Covered on day 3 CPU 1 CPU 2 CPU 3 Network memory memory memory Symmetric Multi Processor (SMP usually cache coherent ) Orion JPL NASA CPU 1 CPU 2 CPU 3 memory memory memory Network Distributed Shared Memory (DSM often not cache coherent) 22

23 SMP Context A standalone system Incorporates everything needed for computation: Processors Memory External I/O channels Local disk storage User interface Enterprise server and institutional computing market Exploits economy of scale for enhanced performance-to-cost Substantial performance Target for ISVs (Independent Software Vendors) Shared memory multiple thread programming platform Easier to program than distributed memory machines Enough parallelism to fully employ system threads (processor cores) Building block for ensemble supercomputers Commodity clusters MPPs 23

24 Major Elements of an SMP Node Processor chip DRAM main memory cards Motherboard chip set On-board memory network North bridge On-board I/O network South bridge PCI industry standard interfaces PCI, PCI-X, PCI-express System Area Network controllers e.g. Ethernet, Myrinet, Infiniband, Quadrics, Federation Switch System Management network Usually Ethernet JTAG for low level maintenance Internal disk and disk controller Peripheral interfaces 24

25 SMP Node Diagram MP L1 L2 L3 MP L1 L2 MP L1 L2 L3 MP L1 L2 Legend : MP : MicroProcessor L1,L2,L3 : Caches M1.. : Memory Banks S : Storage NIC : Network Interface Card M 1 M 2 M n-1 NIC Controller NIC S S PCI-e JTAG Ethernet Peripherals USB 25

26 Performance Issues for SMP Nodes Cache behavior Hit/miss rate Replacement strategies Prefetching Clock rate ILP Branch prediction Memory Access time Bandwidth Bank conflicts 26

27 Sample SMP Systems DELL PowerEdge HP Proliant Intel Server System Microway Quadputer IBM p

28 HyperTransport-based SMP System Source: 28

29 IBM Power 7 Processor and Core P7 Processor P7 Core 29 29

30 Processor Core Micro Architecture Execution Pipeline Stages of functionality to process issued instructions Hazards are conflicts with continued execution Forwarding supports closely associated operations exhibiting precedence constraints Out of Order Execution Uses reservation stations hides some core latencies and provide fine grain asynchronous operation supporting concurrency Branch Prediction Permits computation to proceed at a conditional branch point prior to resolving predicate value Overlaps follow-on computation with predicate resolution Requires roll-back or equivalent to correct false guesses Sometimes follows both paths, and several deep 30

31 Topics Overview of Supercomputer Architecture Enabling Technologies SMP Memory Hierarchy Commodity Clusters System Area Networks 31

32 What is a cache? Small, fast storage used to improve average access time to slow memory. Exploits spatial and temporal locality In computer architecture, almost everything is a cache! Registers: a cache on variables First-level cache: a cache on second-level cache Second-level cache: a cache on memory Memory: a cache on disk (virtual memory) TLB: a cache on page table Branch-prediction: a cache on prediction information Proc/Regs Bigger Memory L2-Cache L1-Cache Faster Disk, Tape, etc. Copyright 2001, UCB, David Patterson 32

33 Multicore Microprocessor Component Elements Multiple processor cores One or more processors L1 caches Instruction cache Data cache L2 cache Joint instruction/data cache Dedicated to individual core processor L3 cache Not all systems Shared among multiple cores Often off die but in same package Memory interface Address translation and management (sometimes) North bridge I/O interface South bridge 33

34 Capacity Access Time Cost CPU Registers 100s Bytes < 0.5 ns (typically 1 CPU cycle) Cache L1 cache: 10s-100s K Bytes 1-5 ns $10/ Mbyte Main Memory Few G Bytes 50ns- 150ns $0.02/ MByte Disk 100s-1000s G Bytes ns ns $ 0.25/ GByte Tape infinite sec-min $0.0014/ MByte Levels of the Memory Hierarchy Registers Cache Memory Disk Tape Instr. Operands Blocks Pages Files Staging Xfer Unit prog./compiler 1-8 bytes cache cntl bytes OS 512-4K bytes user/operator Mbytes Upper Level faster Larger Lower Level Copyright 2001, UCB, David Patterson 34

35 Performance: Locality Temporal Locality is a property that if a program accesses a memory location, there is a much higher than random probability that the same location would be accessed again. Spatial Locality is a property that if a program accesses a memory location, there is a much higher than random probability that the nearby locations would be accessed soon. Spatial locality is usually easier to achieve than temporal locality A couple of key factors affect the relationship between locality and scheduling : Size of dataset being processed by each processor How much reuse is present in the code processing a chunk of iterations. 35

36 Memory Hierarchy: Terminology Hit: data appears in some block in the upper level (example: Block X) Hit Rate: the fraction of memory accesses found in the upper level Hit Time: Time to access the upper level which consists of RAM access time + Time to determine hit/miss Miss: data needs to be retrieved from a block in the lower level (Block Y) Miss Rate = 1 - (Hit Rate) Miss Penalty: Time to replace a block in the upper level + Time to deliver the block to the processor Hit Time << Miss Penalty (500 instructions on 21264!) To Processor From Processor Upper Level Memory Blk X Lower Level Memory Blk Y Copyright 2001, UCB, David Patterson 36

37 Cache Performance T = I count T = I count CPI T cycle I count = I ALU + I MEM æ CPI = I ö æ ALU ç CPI ALU + I ö MEM ç CPI MEM é ë M ALU CPI ALU è ø è ø I count I count ( ) + ( M MEM ( CPI MEM-HIT + r MISS CPI MEM-MISS )) ù û T cycle T = total execution time T cycle = time for a single processor cycle I count = total number of instructions I ALU = number of ALU instructions (e.g. register register) I MEM = number of memory access instructions ( e.g. load, store) CPI = average cycles per instructions CPI ALU = average cycles per ALU instructions CPI MEM = average cycles per memory instruction r miss = cache miss rate r hit = cache hit rate CPI MEM-MISS = cycles per cache miss CPI MEM-HIT =cycles per cache hit M ALU = instruction mix for ALU instructions M MEM = instruction mix for memory access instruction 37

38 Cache Performance: Example HIT MEM MISS MEM cycle ALU MEM count CPI CPI ns T CPI I I count MEM MEM count ALU ALU MEM count ALU I I M I I M I I I 150sec )) (0.2 1) (( ) ( A MISS MEM A MISS HIT MEM A MEM hita T CPI r CPI CPI r 550sec )) (0.2 1) (( ) ( B MISS MEM B MISS HIT MEM B MEM hitb T CPI r CPI CPI r

39 Performance Shared Memory (OpenMP): Key Factors Load Balancing : mapping workloads with thread scheduling Caches : Write-through Write-back Locality : Temporal Locality Spatial Locality How Locality affects scheduling algorithm selection Synchronization : Effect of critical sections on performance 39

40 Topics Overview of Supercomputer Architecture Enabling Technologies SMP Memory Hierarchy Commodity Clusters System Area Networks 40

41 What is a Commodity Cluster It is a distributed/parallel computing system It is constructed entirely from commodity subsystems All subcomponents can be acquired commercially and separately Computing elements (nodes) are employed as fully operational standalone mainstream systems Two major subsystems: Compute nodes System area network (SAN) Employs industry standard interfaces for integration Uses industry standard software for majority of services Incorporates additional middleware for interoperability among elements Uses software for coordinated programming of elements in parallel 41

Cluster System Login & Cluster Access Resource

L2 MP L1 L2 MP L1 L2 L3 M1 M2 M n-1 MP L1 L2 L3 MP L1

Controller S Controller S Controller S Controller S

42 Cluster System Login & Cluster Access Resource management & scheduling subsystem MP L1 L2 L3 MP L1 L2 MP L1 L2 MP L1 L2 L3 M1 M2 M n-1 MP L1 L2 L3 MP L1 L2 MP L1 L2 MP L1 L2 L3 M1 M2 M n-1 MP L1 L2 L3 MP L1 L2 MP L1 L2 MP L1 L2 L3 M1 M2 M n-1 MP L1 L2 L3 MP L1 L2 MP L1 L2 MP L1 L2 L3 M1 M2 M n-1 Compute Nodes Controller S Controller S Controller S Controller S NIC NIC S NIC NIC S NIC NIC S NIC NIC S Interconnect Network 42

43 Clusters Dominate Top

44 44

UC-Berkeley NOW Project NOW-1 1995 32-40 SparcStation 10s and

Ultra Sparc 170s 128 MB, 2 2GB disks, ethernet, myrinet largest

45 UC-Berkeley NOW Project NOW SparcStation 10s and 20s originally ATM first large myrinet network NOW Ultra Sparc 170s 128 MB, 2 2GB disks, ethernet, myrinet largest Myrinet configuration in the world First cluster on the TOP500 list 45

46 Machine Parameters affecting Performance Peak floating point performance Main memory capacity Bi-section bandwidth I/O bandwidth Secondary storage capacity Organization Class of system # nodes # processors per node Accelerators Network topology Control strategy MIMD Vector, PVP SIMD SPMD 46

47 Why are Clusters so Prevalent Excellent performance to cost for many workloads Exploits economy of scale Mass produced device types Mainstream standalone subsystems Many competing vendors for similar products Just in place configuration Scalable up and down Flexible in configuration Rapid tracking of technology advance First to exploit newest component types Programmable Uses industry standard programming languages and tools User empowerment 47

48 Key Parameters for Cluster Computing Peak floating-point performance Sustained floating-point performance Main memory capacity Bi-section bandwidth I/O bandwidth Secondary storage capacity Organization Processor architecture # processors per node # nodes Accelerators Network topology Logistical Issues Power Consumption HVAC / Cooling Floor Space (Sq. Ft) 48

49 Where s the Parallelism Inter-node Multiple nodes Primary level for commodity clusters Secondary level for constellations Multi socket, intra-node Routinely 1, 2, 4, 8 Heterogeneous computing with accelerators Multi-core, intra-socket 2, 4 cores per socket Multi-thread, intra-core None or two usually ILP, intra-core Multiple operations issued per instruction Out of order, reservation stations Prefetching Accelerators 49

50 Topics Overview of Supercomputer Architecture Enabling Technologies SMP Memory Hierarchy Commodity Clusters System Area Networks 50

Fast and Gigabit Ethernet Cost effective Lucent,

Directly leverage LAN technology and market Up to

stacked on connected with multiple gigabit links

microseconds 1000 Base-T: Bandwidth: ~ 50 MB/s

51 Fast and Gigabit Ethernet Cost effective Lucent, 3com, Cisco, etc. Directly leverage LAN technology and market Up to Mbps ports in one switch Switches can be stacked on connected with multiple gigabit links 100 Base-T: Bandwidth: > 11 MB/s Latency: < 90 microseconds 1000 Base-T: Bandwidth: ~ 50 MB/s Latency: < 90 microseconds 10 GbE: Bandwidth: 1250 MB/s Latency: ~2.5 microseconds 51

52 InfiniBand High Performance: Gbps Low latency: 1.2 microseconds Copper interconnects High availability - IEEE 802.3ad Link Aggregation / Channel Bonding 52

53 Dell PowerEdge SC1435 Opteron, IBA 53

54 Network Interconnect Topologies FAT-TREE (CLOS) TORUS 54

55 Example: 320-host Clos topology of 16-port switches 64 hosts 64 hosts 64 hosts 64 hosts 64 hosts (From Myricom) 55

56 Arete Infiniband Network 56

CISC 662 Graduate Computer Architecture Lecture 16 - Cache and virtual memory review

CISC 662 Graduate Computer Architecture Lecture 6 - Cache and virtual memory review Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David