Computer Architecture. R. Poss

Size: px

Start display at page:

Download "Computer Architecture. R. Poss"

Rafe Potter
5 years ago
Views:

1 Computer Architecture R. Poss 1 ca01-10 september 2015

2 Course & organization 2 ca01-10 september 2015

3 Aims of this course The aims of this course are: to highlight current trends to introduce the notion of hardware/ software interface to introduce caching and parallelism in computer architecture to introduce simulators & architecture models 3 ca01-10 september 2015

4 Bibliography Useful texts are: Computer architecture - a quantitative approach, Hennessy & Patterson, 4th Edition, ISBN Processor Architecture, Silc, Robic and Ungerer, Springer, ISBN D. Sima, T. Fountain and P. Kacsuk, Advanced Computer Architecture a Design space approach (Addison-Wesley) Your own search-fu -- use Google Scholar! 4 ca01-10 september 2015

5 Course overview Session 1: 1. Trends 2. Hw/Sw interface 3. Pipelining Session 2: 1. Memory and caching 2. Multi-cores and HMT 5 ca01-10 september 2015

6 What is computer architecture? 6 ca01-10 september 2015

7 An engineering domain layers of composition and complexity: from parts to whole refined matter electronics logic circuits components platforms systems petro-chemicals packaging Backplanes Processors metals metal oxides GaA/Si/SiGe/SiC crystals semiconductors magnetic substrates CMOS NMOS ELECTRONIC ENGINEERING You are here The hidden partner activity: compilers, operating systems Links Functional Units Storage Memories Caches Networks COMPUTER ARCHITECTURE Algorithms Frameworks Operating software SOFTWARE ENGINEERING Computing platforms Software programs Computational clusters Embedded systems Personal computers Game consoles SYSTEMS ARCHITECTURE 7 ca01-10 september 2015

8 Trends 8 ca01-10 september 2015

9 The Big Change Power vs chip area: before: power free, transistors expensive now: power expensive, transistors cheap Storage vs computation: before: computation slow, storage fast now: storage slow, computation fast Computation vs storage cost: before: small storage, ok to compute more to save space (eg compression) now: large storage, expensive to compute more 9 ca01-10 september 2015

10 Moore s law the number of transistors on integrated circuits doubles approximately every two years 10 ca01-10 september 2015

11 Moore s law Why/how: CMOS: logic based on semiconductor gates in silicon, DRAM: single-gate memory cells laser photolithography to sculpt gates at atomic scale Fundamental limits: can t make CMOS smaller than atoms in silicon difficult to increase precision of lasers in manufacturing Probable evolutions: number of transistors per unit of area in silicon will stabilize likely: larger chips + 3D designs with multiple layers of silicon (more area) 11 ca01-10 september 2015

12 Latency lags bandwidth Improvements over ca. 20 years Latency Bandwidth Transistors Processors /30 x3000 x Networks /20 x1000 Memory /4 x200 x Disks /10 x ca01-10 september 2015

13 Latency lags bandwidth Moore s law helps bandwidth more than latency More transistors + more pins = more bandwidth Distance limits latency, storage capacity increases distance More transistors = relatively longer lines We will study this later in the context of memories Market bias: bandwidth easier to sell, so more investment there 13 ca01-10 september 2015

14 Latency lags bandwidth Latency helps bandwidth, but not the other way around eg: faster disk spin rate: shorter access times, more requests by second but: more disks in parallel = more bandwidth, same latency Bandwidth hurts latency Queues help bandwidth, hurts latency (queuing theory) adding parallelism actually increases latency (cf later lecture) 14 ca01-10 september 2015

15 Latency lags bandwidth Summary: For 1 component, bandwidth increases by square of latency decrease Parallelism allows to scale bandwidth arbitrarily, but keeps latency constant or increases Similar ratios for performance vs execution time These trends are there to stay 15 ca01-10 september 2015

the memory wall + seq. performance wall http://www.gotw.

16 Trends - the free lunch is over This is the story of uniprocessor performance This is the power wall This is the memory wall + seq. performance wall 16 ca01-10 september 2015

17 The Big Change (cont.) A dramatic change in processor chips: Memory wall: processors much faster than memories Power wall: can t power all transistors lest the chip will fry Sequential performance wall: more transistors don t help sequential performance any more This course will suggest how we got here, why these problems are happening and what we can do about it 17 ca01-10 september 2015

18 Principles of Comp. Arch. 18 ca01-10 september 2015

19 Contribution of Comp. Arch. Quantitative principles of design Take advantage of parallelism Principle of locality Focus on the common case Amdahl s laws Careful, quantitative comparison: define, quantify, summarize Anticipating and exploiting advances in technology Well-defined interfaces, carefully implemented and thoroughly defined 19 ca01-10 september 2015

20 Parallelism Three main strategies: Increase bandwidth and throughput by duplicating storage and data paths Use pipelining, ie assembly line Perform operations out of order, including simultaneously Fundamental limits: pipeline hazards time and data dependencies = mandatory order 20 ca01-10 september 2015

21 Locality Principle: individual programs access a relatively small portion of memory in a small amount of time Two different types: Temporal locality: if an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse) Spatial locality: if an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straight-line code, array access) Caches are a fundamental mechanism to take advantage of locality 21 ca01-10 september 2015

22 Memory hierarchy and latency Registers - on-chip SRAM L1 cache - on-chip SRAM L2 cache - on-chip SRAM off-chip SRAM L3 cache - off-chip SRAM Main memory - DRAM Distributed memory 100MHz clocks < GHz clocks < ca01-10 september 2015

23 Memory hierarchy and latency Registers - on-chip SRAM L1 cache - on-chip SRAM L2 cache - on-chip SRAM off-chip SRAM L3 cache - off-chip SRAM Main memory - DRAM Distributed memory Size 100MHz clocks < GHz clocks < ca01-10 september 2015

memory - DRAM Distributed memory 1/cycle time 100MHz clocks < 1 1-2 2-6 4-8

24 Memory hierarchy and latency Registers - on-chip SRAM L1 cache - on-chip SRAM L2 cache - on-chip SRAM off-chip SRAM L3 cache - off-chip SRAM Main memory - DRAM Distributed memory 1/cycle time 100MHz clocks < GHz clocks < Size 22-3 ca01-10 september 2015

25 Focus on the common case In making a design trade-off, favor the frequent case over the infrequent case E.g., Instruction fetch and decode unit used more frequently than multiplier, so optimize it 1st E.g., If database server has 50 disks / processor, storage dependability dominates system dependability, so optimize it 1st Frequent case is often simpler and can be done faster than the infrequent case What is frequent case and how much performance improved by making case faster => Amdahl s Law 23 ca01-10 september 2015

26 Amdahl s law on speedup Consider a computation P which contains two parts A and B in sequence A can be enhanced (eg more parallelism, more performance); B cannot T(P) = T(A) + T(B) ( T = time to complete ) Imagine we can accelerate A infinitely so that T(A) becomes 0 Intuitively: overall speedup is limited by T(B) If the complexity ratio between A and B is P [A/B] (proportion), and A can be accelerated by a factor S A (speedup), Amdalh s law says: S overall = 1 / ( (1 - P [A/B] ) + (P [A/B] / S A ) ) 24 ca01-10 september 2015

27 Amdahl s law example An algorithm contains a sequential section and a parallel section The parallel section contains 20% of the computation steps (P=0.2) The parallel section can be accelerated by a factor N by using N processors/ cores Maximum speedup with N cores = 1 / ( ( 1-0.2) + (0.2 / N) ) With N = 100, speedup = 1.24X (100 cores, yet only 24% perf increase!) This is the fundamental limit to parallelism: to maximize performance gains, need to first increase the proportion of the parallel section. 25 ca01-10 september 2015

28 Amdahl s law on design A balanced system design should provision 1 bit per second of external bandwidth for each potential instruction per second Too little external bandwidth: I/O bound Too little instructions/second: compute-bound Desktop computers are traditionally I/O bound Mainframes are usually compute bound Multi-cores require huge amount of bandwidth to stay balanced 26 ca01-10 september 2015

29 The Hw/Sw interface 27 ca01-10 september 2015

30 How do programs run? General computer model: processor + memory + interconnect + I/O devices Software is just bits, so is data How does software translate into behavior? ie. communication, computation and control? Your take here 28 ca01-10 september 2015

31 System initialization What happens when you switch the computer on? Define/explain the relationships between: Reset signal Initial program counter Boot ROM Boot code Operating system Start-up storage Disks I/O interface CPU ROM RAM 29 ca01-10 september 2015

32 Why simulators? Why not native? This course will talk about the processor(s) in your desktops/ laptop machines But all x86 processors are really RISC under the hood More useful to study RISC to understand the main problems Also: for your lab assignments you will study low-level architecture behavior under control of assembly code Easier with a simulator than real hardware! 30 ca01-10 september 2015

33 Summary What you can take away: What is computer architecture and why it is important Some general principles of comp. arch. science Intro to the hardware/software interface 31 ca01-10 september 2015

CS654 Advanced Computer Architecture. Lec 2 - Introduction

CS654 Advanced Computer Architecture Lec 2 - Introduction Peter Kemper Adapted from the slides of EECS 252 by Prof. David Patterson Electrical Engineering and Computer Sciences University of California,