Enabling Technologies

Size: px

Start display at page:

Download "Enabling Technologies"

Reynard Stewart
6 years ago
Views:

1 High Performance Computing: Concepts, Methods, & Means Enabling Technologies Prof. Thomas Sterling Department of Computer Science Louisiana State University March 13 th, 2007

2 Topics Introduction Taxonomy of Technologies State of the Art Technology Trends Summary Material for Test 2

3 Topics Introduction Taxonomy of Technologies State of the Art Technology Trends Summary Material for Test 3

4 Space of Consideration Memory Logic Vacuum tubes Transistors LSI 4KB (Williams tube) 10kHz (6V6) 1MB (magnetic core) 1MHz (silicon) Communication Pulse mode logic omnibus BUS 4GB (DRAM) 3GHz (CMOS) Busses, Bridges System LAN 4

5 Why do WE care? Speed Density Balance Power, size, cost Architecture Operations Configuration, total system size 5

6 A Growth-Factor of a Billion in Performance in a Single Lifetime 1949 Edsac 1959 IBM Cray Intel Delta 1996 T3E 2003 Cray X One OPS KiloOPS MegaOPS GigaOPS TeraOPS PetaOPS 1823 Babbage Difference Engine 1943 Harvard Mark Univac CDC Cray XMP 1988 Cray YMP 1997 ASCI Red 2001 Earth Simulator 6

7 Topics Introduction Taxonomy of Technologies State of the Art Technology Trends Summary Material for Test 7

8 Current Technologies & Metrics Memory DRAMs Access Times Bandwidth Capacity, Size Microprocessors Clock rate Instructions per Cycles (ILP) Power I/O Channels Bandwidth Latency Disks Access Times Bandwidth Capacity 8

9 SMP Node Diagram MPU L1 L2 L3 MPU L1 L2 MPU L1 L2 L3 MPU L1 L2 Legend : MPU : MicroProcessor Unit L1,L2,L3 : Caches M1.. : Memory Banks S : Storage NIC : Network Interface Card M 1 M 2 M n-1 NIC Controller NIC S S PCI-e JTAG Ethernet Peripherals USB 9

10 Memory - Overview Temporary storage location used to store instructions and data. Instructions, actual operations executed by the processor. Data used and produced by peripherals such as harddisk or network controllers and intermediate results from program execution etc. Both Instructions and data required by processor to compute meaningful results. Processor is constantly issuing commands to load and store data from memory across memory bus. Due to the constant memory accesses by the processor and the large gap between processor clock rate and memory bus is one of the largest impediments to achieving theoretical peak performance. DDR2 PC

11 Another View of the Memory Hierarchy Regs Instr. Operands Cache Blocks L2 Cache Blocks Memory Pages Disk Files Tape Upper Level Faster Larger Lower Level

12 Memory - Overview Memory bus performance is characterized by : Memory Bandwidth : The burst rate at which data can be copied between the DRAM memory chips and the CPU (total number of accesses per unit time) eg: current rates range up to 6.4 GB/s for DDR2 PC Memory Latency : The amount of time it takes to move data between RAM and the CPU eg : current latencies range up to 80.5 ns for DDR2 PC Many applications depend on availability of entire datasets in RAM. Alternatively disk storage could be used; however this usually entails performance penalties due to higher access and retrieval times. Thus Memory becomes a crucial factor in system design and determines the size of the problem that can be run on the system. Usual rule-of-thumb 1 byte of RAM for every floating point operation. (actual requirements vary on case by case basis). 12

13 Magnetic Core Memory 13

14 2 nd Generation: Transistors Replaced vacuum tubes Smaller & Cheaper Less heat dissipation Solid State device (silicon) Invented 1947 at Bell Labs The First Transistor

15 Integrated Circuit Costs Wafer cost Cost of die = Dies per wafer * Die yield where die yield is the percentage of good dies in the wafer.

16 1-Transistor Memory Cell (DRAM) Write: 1. Drive bit line 2.. Select row Read: 1. Precharge bit line to Vdd 2.. Select row 3. Cell and bit line share charges bit Very small voltage changes on the bit line 4. Sense (fancy sense amp) Can detect changes of ~1 million electrons 5. Write: restore the value Refresh 1. Just do a dummy read to every cell. row select

17 Classical DRAM Organization (square) bit (data) lines r o w d e c o d e r row address RAM Cell Array Column Selector & I/O Circuits data Each intersection represents a 1-T DRAM Cell word (row) select Column Address Row and Column Address together: Select 1 bit a time

18 DRAM Read Timing Every DRAM access begins at: The assertion of the RAS_L 2 ways to read: early or late v. CAS RAS_L DRAM Read Cycle Time RAS_L CAS_L WE_L OE_L A 256K x 8 9 DRAM 8 D CAS_L A Row Address Col Address Junk Row Address Col Address Junk WE_L OE_L D High Z Junk Data Out High Z Data Out Read Access Time Output Enable Delay Early Read Cycle: OE_L asserted before CAS_L Late Read Cycle: OE_L asserted after CAS_L

19 Static RAM Cell 6-Transistor SRAM Cell word 0 1 (row select) word 0 1 bit bit Write: 1. Drive bit lines (bit=1, bit=0) 2.. Select row Read: bit 1. Precharge bit and bit to Vdd or Vdd/2 => make sure equal! 2.. Select row 3. Cell pulls one line low 4. Sense amp on column detects difference between bit and bit bit

20 Din 3 Typical SRAM Organization: 16-word x 4-bit Din 2 Din 1 Din 0 Precharge WrEn Wr Driver & Wr Driver & Wr Driver & Wr Driver & - Precharger+ - Precharger+ - Precharger+ - Precharger+ SRAM Cell SRAM Cell SRAM Cell SRAM Cell SRAM Cell SRAM Cell SRAM Cell SRAM Cell : : : : Word 0 Word 1 Address Decoder A0 A1 A2 A3 SRAM Cell SRAM Cell SRAM Cell SRAM Cell - Sense Amp + - Sense Amp + - Sense Amp + - Sense Amp + Dout 3 Dout 2 Dout 1 Dout 0 Word 15 Q: Which is longer word line or bit line?

Microprocessor - Overview The single component that implements instruction execution Lowest level binary encoding of instructions and the actions they perform are dictated by the microprocessor

21 Microprocessor - Overview The single component that implements instruction execution Lowest level binary encoding of instructions and the actions they perform are dictated by the microprocessor instruction set architecture (ISA). Most common ISA used for a cluster node is the IA32 or x86_64 family. This includes all generations of Pentium and Athlon processor family. A processor runs at a particular clock rate ie it can execute instructions at a particular frequency usually measured in megahertz or gigahertz. Note : A processor s clock rate is not a direct measure of its performance. Two processors with the same clock rate can perform differently for some tasks. Opteron

22 Computer Generations Generation Dates Technology Operations per Second Vacuum Tube 40, Transistor 200, Small & Medium Scale Integration Large Scale Integration (LSI) Very Large Scale Integration (VLSI) 1,000,000 10,000, ,000,000

23 IBM 360 series 1964 Replaced (& not compatible with) 7000 series First planned family of computers Similar or identical instruction sets Similar or identical O/S Increasing speed Increasing number of I/O ports (i.e. more terminals) Increased memory size Increased cost Multiplexed switch structure

24 DEC PDP First minicomputer (after miniskirt!) Did not need air conditioned room Small enough to sit on a lab bench $16,000 $100k+ for IBM 360 Embedded applications & OEM BUS STRUCTURE

25 DEC - PDP-8 Bus Structure Console Controller CPU Main Memory I/O Module I/O Module OMNIBUS

26 Microprocessor - Overview Every processor has a theoretical peak speed ie, the maximum rate of instruction execution a processor can achieve. Theoretical peak performance (TPP) of a processor is determined by clock rate, ISA and components included in the processor. TPP is measured in floating point operations per second or flops. The current fastest supercomputer BlueGene/L has two commercial IBM PowerPC 440 microprocessors on each compute node with a TPP of 2.8 (each)/5.6 (combined) GF/s Both instruction and data that are utilized by the processor are stored in the Memory. Memory usually runs at a much slower clock rate than the processor, hence the processor often waits for memory. Hence the overall rate at which programs run is usually a combination of the three factors namely: the memory system performance and the processor s clock speed. the number of operations issued per instruction 26

Microprocessors - Overview Delays introduced by constant memory accesses by the processor can be mitigated using the cache. The cache is a small amount of fast memory usually co-located with the CPU.

27 Microprocessors - Overview Delays introduced by constant memory accesses by the processor can be mitigated using the cache. The cache is a small amount of fast memory usually co-located with the CPU. When data is accessed from memory, it is stored in cache. Future repeated accesses of the same data can be expedited by utilizing preexisting cache copies of the data. Applications optimized to utilize these patterns can improve processor utilization as the processor spends less time waiting for data and more time processing information. fast slow small large 27

28 Intel Microprocessor Performance

29 DRAM and Processor Characteristics

30 Processor-DRAM Memory Gap (latency) Moore s Law CPU DRAM µproc 60%/yr. (2X/1.5yr) Processor-Memory Performance Gap: (grows 50% / year) DRAM 9%/yr. (2X/10 yrs) Performance Time

I/O Channels I/O channels are buses that connect peripherals with main memory Peripherals include : disk and

. Each of these devices are connected to the main memory via a bridge (usually referred to as the PCI chipset).

Most common I/O channel in community hardware is the PCI buses.

31 I/O Channels I/O channels are buses that connect peripherals with main memory Peripherals include : disk and network controllers USB and firewire etc.. Each of these devices are connected to the main memory via a bridge (usually referred to as the PCI chipset). Since I/O tasks are most common on computers, this subsystem is an integral part of any system. Most common I/O channel in community hardware is the PCI buses. Several flavors of PCI exist, PCI, PCI-X, PCIe PCI X based Intel PRO/1000 Gigabit Ethernet adapter M3F-PCIXD-2 Myrinet-Fiber/PCI-X Interface PCI-X slots on a Motherbord PCI-X 133 MHz Card Two 4X Infiniband Ports (10 Gb/sec each) 256 MB memory 31

I/O Channels Motherboard The motherboard provides the logical and physical infrastructure for integrating the subsystems of a cluster node and

Sockets and connectors on the motherboard include the following : Microprocessor(s), Memory Peripheral Controllers (PCI-X), AGP port (graphics)

32 I/O Channels Motherboard The motherboard provides the logical and physical infrastructure for integrating the subsystems of a cluster node and determines the set of components that may be used. Sockets and connectors on the motherboard include the following : Microprocessor(s), Memory Peripheral Controllers (PCI-X), AGP port (graphics) Power,External I/O for USB, Keyboard, mouse etc. Other chips on the motherboard provide : The system bus that links processor(s) to memory The interface between the peripheral buses and the system bus Programmable read-only memory (PROM) containing the BIOS software. 32

I/O Channels Chipsets & BIOS Chipsets are combination of all logic on the motherboard, these include the memory bus, PCI, PCI-X and AGP bridges, disk controllers, USB controllers etc.

33 I/O Channels Chipsets & BIOS Chipsets are combination of all logic on the motherboard, these include the memory bus, PCI, PCI-X and AGP bridges, disk controllers, USB controllers etc. Chipsets can be split into two logical portions : North bridge: connects the front side bus that connects the processor, the memory bus and AGP. AGP is located on the Northbridge so as to have special access to main memory. South bridge: contains I/O bus bridges and any integrated peripherals that may be included like disk and USB controllers. BIOS is the software that initializes all system hardware into a state that OS can boot. PXE (Pre execution environment) is a system by which nodes can boot based on a network-provided configuration and boot image. Many new machines support this feature and cluster management systems utilize this feature for installations. LinuxBIOS : BIOS based on Linux kernel that can perform all important tasks needed for OS to boot. Since source code for BIOS is available firmware upgrades can be more easily carried out. These BIOSs also have faster boot times than conventional BIOSs 33

34 Storage Local Hard Disks A hard drive contains several platters, data is read off these platters as they rotate. Logic in the drive optimizes the read & write requests based on the geometry of the disks to provide better collective performance. The Logic also contains memory cache which helps prevents the need for multiple reads for the same data. Hard disks are magnetic storage media that interface with some sort of storage bus. Three most commonly used storage buses are IDE (EIDE or ATA), SCSI, Serial ATA. Controllers to manage these busses are integrated into most motherboards and can support up to 4 devices. UDMA133 is one such bus that runs at the rate of 133 MB/s. 34

35 Storage - Locality Often and application reads consecutive sectors Most hard drives do read ahead The disk has a buffer that stores sectors after the one just read It can be as large as 4MB It s just a cache of sectors The smarts in there are not well-known due to proprietary technology Can also store sectors that need to be written to disk Transfers to/from the buffer are at the speed of the I/O bus, not the magnetic device Can be > 300MB/sec More on the I/O bus later

36 Topics Introduction Taxonomy of Technologies State of the Art Technology Trends Summary Material for Test 36

Memory Speeds and Trends Technology Speed Module Bandwidth (max theoretical) SDR PC100

1 GB/sec DDR PC2700 2.7 GB/sec DDR PC3200 3.2 GB/sec DDR PC4000 4.

37 Memory Speeds and Trends Technology Speed Module Bandwidth (max theoretical) SDR PC GB/sec SDR PC GB/sec DDR PC GB/sec DDR PC GB/sec DDR PC GB/sec DDR PC GB/sec DDR PC GB/sec DDR2 PC GB/sec DDR2 PC GB/sec DDR2 PC GB/sec DDR2 PC GB/sec Source : 37

38 Memory Size Latency (ns) Bandwidth (MB/sec) Registers < 1KB , ,000 Cache <16MB 0.5 (on-chip) - 25 (off-chip) Managed by Compiler ,000 Hardware Main Memory < 16GB O/S Disk > 100GB 5,000, O/S

DRAM Implementations: DDR DDR (Double Data Rate) memory 2x64 bits transferred in a single bus cycle (at both clock edges) DDR-400 operates at 200 MHz clock The corresponding memory module is PC3200,

39 DRAM Implementations: DDR DDR (Double Data Rate) memory 2x64 bits transferred in a single bus cycle (at both clock edges) DDR-400 operates at 200 MHz clock The corresponding memory module is PC3200, delivering a peak bandwidth of 3.2 GB/s Cycle time 5ns, CAS latency 3 Module capacity: up to 4 GB Features 2-bit wide prefetch buffers DDR2 memory Operates at twice the bus speed of DDR DDR2-800 achieves 800 million transfers per second using 400 MHz bus clock The corresponding module PC has a peak bandwidth of 6.4 GB/s Cycle time 2.5ns, CAS latency 5 Module capacity: up to 4 GB Features 4 bit wide prefetch buffers DDR3 (successor to DDR2) is currently sampling Expected to achieve up to 1600 million transfers per second (12.8 GB/s per module) with 800 MHz clock Features 8 bit wide prefetch buffers 39

4 GB/s per chip Planned clock speeds up to 1 GHz, currently the fastest parts run at 500 MHz Current capacity: 512 Mbit per chip GDDR4 (Graphics

40 Other DRAM Implementations XDR (extreme Data Rate) memory Based on Rambus DRAM technology Eight bits per clock per lane ( Octal Data Rate ) One chip provides either 8 or 16 lanes At typical 400 MHz clock, the peak bandwidth is 6.4 GB/s per chip Planned clock speeds up to 1 GHz, currently the fastest parts run at 500 MHz Current capacity: 512 Mbit per chip GDDR4 (Graphics Double Data Rate version 4) memory 2.8 Gbit/s data rate at 1.4 GHz clock per pin 11.2 GB/s per chip with 32-bit data bus CAS latency of 18 clock cycles Current capacity: 512 Mbit per chip 8 bit prefetch buffer width 40

41 Modern Processor Parameters Clock Speed (#cores) Cache Sizes (per core) AMD Opteron 2.8 GHz (2) L1: 64+64KB L2: 1MB IBM Power GHz (2) L1: 64+32KB L2: 1.875MB L3: 18MB Intel Itanium Intel Xeon 7140M Sun UltraSparc T GHz (2) L1: 16+16KB L2: KB L3: up to 12MB 3.4 GHz (2) L1: 32+32KB L2: 2MB 1.4 GHz (8) L1: 16+8KB L2: 512KB IPC (per core) 2 FP, 3 Integer 119 W 4 FP, 2 Integer 70 W 4 FP, 4 Integer 104 W 4 FP, 3 Integer 150 W 1 FPU, 2 ALUs, crypto unit 84 W Power 41

I/O Channel PCI Express (3GIO): 16Gb/sec HyperTransport (LDT): 41.6GB/sec @ 2.6GHz PCI Bus, PCI-X Bus: 1GB/sec AGP (Accelerated Graphics Port): 2.

42 I/O Channel PCI Express (3GIO): 16Gb/sec HyperTransport (LDT): 2.6GHz PCI Bus, PCI-X Bus: 1GB/sec AGP (Accelerated Graphics Port): 2.134GB/sec Others: PCMCIA (Personal Computer Memory Card International Association) ISA Bus (Industry Standard Architecture) USB (Universal Serial Bus) RapidIO 42

43 I/O Channels PCI Express 1.1: 250 MB/s per lane Card slots may include up to 32 lanes for peak rate of 8 GB/s PCI-X 2.0: 64 bit wide at 533 MHz 4.3 GB/s throughput AGP 8x (Advanced Graphics Port): 32-bit channel operating at 66 MHz (strobing 8 times per clock) Peak bandwidth of 2133 MB/s HyperTransport 3.0: Up to 32 bits at 2.6 GHz, transmitted at both clock edges Peak bandwidth 20,800 MB/s 43

Chipset I/O Capabilities PCI Express: 56 lanes, 250 MB/s each 12 links 5 slots SATA: 12 channels, 3 Gbps each HyperTransport: 8 GB/s throughput to the CPU Supports up to 8

44 Chipset I/O Capabilities PCI Express: 56 lanes, 250 MB/s each 12 links 5 slots SATA: 12 channels, 3 Gbps each HyperTransport: 8 GB/s throughput to the CPU Supports up to 8 processors Gigabit Ethernet: 4 MAC units USB 2.0 ports: 10 at 480 Mbps each Support of RAID 0, 1, 0+1 and 5 High Definition Audio (HDA) 8 channels 192 khz/32-bit quality 44

45 45

46 Hypertransport (AMD) LDT: Lightning Data Transport Aggregate Bandwidth : 41.6 GB/s (HyperTransport 3.0) Point-to-Point bus with [at least] two unidirectional links Uses 2, 4, 8, 16 or 32 bits [in each direction]. Data rate is 800MBs/per 8 bit pair(s) with a 400MHz clock. BW in both directions is 1.6GBps for 8 bit [bi-directional] pairs. 16 bi-directional pairs brings the data rate up to 3.2GBps per direction. HT has an I/O Link protocol specifiere: packet-based. AMD MotherBoards uses a bridge to communicate to PCI-X [high-end PCs] / PCI [Desktops] buses In HyperTransport their is an identical uni-directional link coming back from the far end. one uni-directional link 46

Permanent Storage: Hard Disks Storage capacity: 1 TB per drive Areal density: 132 Gbit/in 2 (perpendicular recording) Rotational speed: 15,000 RPM Average latency: 2 ms Seek time Track-to-track: 0.

47 Permanent Storage: Hard Disks Storage capacity: 1 TB per drive Areal density: 132 Gbit/in 2 (perpendicular recording) Rotational speed: 15,000 RPM Average latency: 2 ms Seek time Track-to-track: 0.2 ms Average: 3.5 ms Full stroke: 6.7 ms Sustained transfer rate: up to 125 MB/s Non-recoverable error rate: 1 in Interface bandwidth: Fibre channel: 400 MB/s Serially Attached SCSI (SAS): 300 MB/s Ultra320 SCSI: 320 MB/s Serial ATA (SATA): 300 MB/s 47

Storage SATA & Overview Serial ATA is the newest commodity hard disk standard. SATA uses serial buses as opposed to parallel buses used by ATA and SCSI.

The Basic disk technologies remain the same across the three busses The platters in disk spin at variety of speeds, faster the platters spin the faster the

48 Storage SATA & Overview Serial ATA is the newest commodity hard disk standard. SATA uses serial buses as opposed to parallel buses used by ATA and SCSI. The cables attached to SATA drives are smaller and run faster (around 150 MB/s). The Basic disk technologies remain the same across the three busses The platters in disk spin at variety of speeds, faster the platters spin the faster the data can be read off the disk and data on the far end of the platter will become available sooner. Rotational speeds range between 5400 RPM to RPM Faster the platters rotate, the lower the latency and higher the bandwidth. PATA vs SATA 48

Storage - RAID RAID stands for Redundant Array of Inexpensive Disks provides a mechanism by which the performance and storage properties of individual disks can be aggregated Group of disks appear to

49 Storage - RAID RAID stands for Redundant Array of Inexpensive Disks provides a mechanism by which the performance and storage properties of individual disks can be aggregated Group of disks appear to be a single large disks; performance of multiple disks is better than single disks. Using multiple disks helps store data in multiple places allowing the system to continue functioning. Both software and hardware raid solutions available. Hardware solutions are more expensive, but provide better performance without CPU overhead. Software solutions provide various levels of flexibility but have associated computational overhead. 49

Each byte of data can be read from multiple locations, so interleaving reads between disks can help double read performance. RAID 1 : Complete copies of data are stored on multiple locations.

50 Storage - Raid Allocation Variety of RAID allocation schemes : RAID 0 : Data is striped across multiple disks. The result of striping is a logical storage device that has the capacity of each disk times the number of disks present in the raid array. Both read and write performances are accelerated. Each byte of data can be read from multiple locations, so interleaving reads between disks can help double read performance. RAID 1 : Complete copies of data are stored on multiple locations. Capacity of one of these RAID sets will be half of its raw capacity. Read performance is accelerated and is comparable to Raid 0. Writes are slowed down, as new data needs to be transmitted multiple times. RAID 5: Like Raid 0 data is striped across multiple disks, with one disk being dedicated to parity. For any block of data stored across the N-1 drives, their parity checksum is computed and is stored on the last disk. Read performance of RAID 5 tends to be good, but the write performance lags behind mirrors because of checksum computation. 50

51 Topics Introduction Taxonomy of Technologies State of the Art Technology Trends Summary Material for Test 51

52 Feature Size Projections 100 Feature Size (nm) 10 1 Reduction Factor: 0.88 per year or 0.7 per 3 years DRAM 1/2 pitch Flash 1/2 Pitch MPU Physical Gate Length MPU/ASIC M1 1/2 pitch MPU Printed Gate Length

53 Projected Density Growth (S^2) Density Relative to Raw DRAM Density vs 2004 Raw MPU Density vs 2004 Basic area Raw Flash scaling Density vs doubles X every 3 Years 3 years

54 Memory Density: Cells Only Mb/sq. cm (Cells Only) 10,000 1, FLASH DRAM SRAM 24-30X DRAM SLC Flash MLC Flash SRAM

55 Chip Capacity Gbits per chip Classical Moore s Law Historical Production Introduction Chip Capacity is No Longer Following Original Moore s Law

Classical DRAM Gbits per chip 1000 100 10 1 0.1 0.01 0.001 0.0001 0.00001 0.

56 Classical DRAM Gbits per chip Historical Production Introduction Memory mats: ~ 1 Mbit each Row Decoders Primary Sense Amps Secondary sense amps & page multiplexing Timing, BIST, Interface Kerf % Chip Overhead Historical SIA Production SIA Introduction Density/Chip has dropped below 4X/3yrs And 45% of Die is Non-Memo

57 Growth in CPU Transistor Count

58 Logic Chip Density Scaling Transistors per Sq. cm. (MIliions) 10,000 1, High Volume MPUs High Performance MPUs ASICs Logic functions per unit area: ~2X every 3 years

59 Peak Logic Clock Rates 100, Clock (MHz) 10,000 1, Classical Moore s Law 3 GHz Classical Moore s Law 3 GHz Clock (MHz) Historical ITRS Max Clock Rate (12 invertors) Feature Size Historical ITRS Max 2005 projection was for 5.2 GHz and we didn t make it in production. Further, we re still stuck at 3+GHz in production.

60 VLSI IC Technology Line width (nm) Clock (GHz) DRAM cost (microcents/bit) MPU cost (microcent/trans) Supply voltage(v) Wiring levels cost per transistor chip density

61 DRAM Prices Source: Computer Architecture: A Quantitative Approach, 2nd Ed. by Hennessy & Patterson

62 Performance Increasing the block size tends to decrease miss rate: 4 0 % 3 5 % 3 0 % Miss rate 2 5 % 2 0 % 1 5 % 1 0 % 5 % 0 % B lo c k s iz e (b y te s ) 1 K B 8 K B 1 6 K B 6 4 K B K B

63 Performance Scaling Single-processor Performance Scaling 55%/year improvement Concurrency New programming models needed? Log2 Speedup Device speed Architectural frequency wall Assume successful 17%/year scaling Conventional architectures cannot improve performance 4.0 Pipelining RISC ILP wall RISC/CISC CPI Industry shifts to frequency dominated strategy nm 65 nm 45 nm 32nm 22nm 63

64 Linpack 1 Exaflops in Zflops 1 Zflops 100 Eflops Spans 13.5 years No.1 machine 4700x No.500 machine 6514x Sum of all machines 3150x 10 Eflops 1 Eflops 100 Pflops 10 Pflops 1 Pflops 100 Tflops SUM N=1 N= Tflops 1 Tflops 100 Gflops 10 Gflops Courtesy of Thomas Sterling 1 Gflops 100 Mflops

SIA ITRS Projections Chip memory capacity Projects 32 Gigabits/chip by 2020 45% of chip is non-memory

Projects 70+ GHz by 2020 Current projections not met ~10X or 32 GHz Conclusions Technology alone

65 SIA ITRS Projections Chip memory capacity Projects 32 Gigabits/chip by % of chip is non-memory Growth factor < 4X every 3 years Logic density 2X every 3 years Factor of 25X Clock rate is uncertain Projects 70+ GHz by 2020 Current projections not met ~10X or 32 GHz Conclusions Technology alone insufficient Power consumption not considered Massive memory/logic imbalance Architecture must make up the difference 65

66 Topics Introduction Taxonomy of Technologies State of the Art Technology Trends Summary Material for Test 66

67 Summary Material for the Test Introduction slides: 4, 5 Taxonomy of Technologies slides: 8 State of the Art slides: 38 41, 43, 44, 47 Technology Trends slides: 60

Node Hardware. Performance Convergence

Node Hardware. Performance Convergence Node Hardware Improved microprocessor performance means availability of desktop PCs with performance of workstations (and of supercomputers of 10 years ago) at significanty lower cost Parallel supercomputers