Part VII Advanced Architectures. Feb Computer Architecture, Advanced Architectures Slide 1

Part VII Advanced Architectures Feb. 2011 Computer Architecture, Advanced Architectures Slide 1

About This Presentation This presentation is intended to support the use of the textbook Computer Architecture: From Microprocessors to Supercomputers, Oxford University Press, 2005, ISBN 0-19-515455-X. It is updated regularly by the author as part of his teaching of the upper-division course ECE 154, Introduction to Computer Architecture, at the University of California, Santa Barbara. Instructors can use these slides freely in classroom teaching and for other educational purposes. Any other use is strictly prohibited. Behrooz Parhami Edition Released Revised Revised Revised Revised First July 2003 July 2004 July 2005 Mar. 2007 Feb. 2011* * Minimal update, due to this part not being used for lectures in ECE 154 at UCSB Feb. 2011 Computer Architecture, Advanced Architectures Slide 2

VII Advanced Architectures Performance enhancement beyond what we have seen: What else can we do at the instruction execution level? Data parallelism: vector and array processing Control parallelism: parallel and distributed processing Topics in This Part Chapter 25 Road to Higher Performance Feb. 2011 Computer Architecture, Advanced Architectures Slide 3

25 Road to Higher Performance Review past, current, and future architectural trends: General-purpose and special-purpose acceleration Introduction to data and control parallelism Topics in This Chapter 25.1 Past and Current Performance Trends 25.2 Performance-Driven ISA Extensions 25.3 Instruction-Level Parallelism 25.4 Speculation and Value Prediction 25.5 Special-Purpose Hardware Accelerators 25.6 Vector, Array, and Parallel Processing Feb. 2011 Computer Architecture, Advanced Architectures Slide 4

25.1 Past and Current Performance Trends Intel 4004: The first p (1971) Intel Pentium 4, circa 2005 0.06 MIPS (4-bit processor) 10,000 MIPS (32-bit processor) 8-bit 8008 8080 8084 8086 8088 80186 80188 16-bit 80286 80386 80486 Pentium, MMX 32-bit Pentium Pro, II Pentium III, M Celeron Feb. 2011 Computer Architecture, Advanced Architectures Slide 5

Newer methods Covered in Part VII Established methods Previously discussed Architectural Innovations for Improved Performance Computer performance grew by a factor of about 10000 between 1980 and 2000 100 due to faster technology 100 due to better architecture Available computing power ca. 2000: GFLOPS on desktop TFLOPS in supercomputer center PFLOPS on drawing board Architectural method Improvement factor 1. Pipelining (and superpipelining) 3-8 2. Cache memory, 2-3 levels 2-5 3. RISC and related ideas 2-3 4. Multiple instruction issue (superscalar) 2-3 5. ISA extensions (e.g., for multimedia) 1-3 6. Multithreading (super-, hyper-) 2-5? 7. Speculation and value prediction 2-3? 8. Hardware acceleration 2-10? 9. Vector and array processing 2-10? 10. Parallel/distributed computing 2-1000s? Feb. 2011 Computer Architecture, Advanced Architectures Slide 6

PFLOPS Peak Performance of Supercomputers 10 / 5 years Earth Simulator ASCI White Pacific TFLOPS ASCI Red TMC CM-5 Cray T3D Cray X-MP TMC CM-2 Cray 2 GFLOPS 1980 1990 2000 2010 Dongarra, J., Trends in High Performance Computing, Computer J., Vol. 47, No. 4, pp. 399-403, 2004. [Dong04] Feb. 2011 Computer Architecture, Advanced Architectures Slide 7

25.2 Performance-Driven ISA Extensions Adding instructions that do more work per cycle Shift-add: replace two instructions with one (e.g., multiply by 5) Multiply-add: replace two instructions with one (x := c + a b) Multiply-accumulate: reduce round-off error (s := s + a b) Conditional copy: to avoid some branches (e.g., in if-then-else) Subword parallelism (for multimedia applications) Intel MMX: multimedia extension 64-bit registers can hold multiple integer operands Intel SSE: Streaming SIMD extension 128-bit registers can hold several floating-point operands Feb. 2011 Computer Architecture, Advanced Architectures Slide 8

25.3 Instruction-Level Parallelism 30% 3 Fraction of cycles 20% 10% Speedup attained 2 0% 0 1 2 3 4 5 6 7 8 Issuable instructions per cycle (a) 1 0 2 4 6 8 Instruction issue width (b) Figure 25.4 Available instruction-level parallelism and the speedup due to multiple instruction issue in superscalar processors [John91]. Feb. 2011 Computer Architecture, Advanced Architectures Slide 9

Instruction-Level Parallelism Figure 25.5 A computation with inherent instruction-level parallelism. Feb. 2011 Computer Architecture, Advanced Architectures Slide 10

VLIW and EPIC Architectures VLIW EPIC Very long instruction word architecture Explicitly parallel instruction computing General registers (128) Execution unit Execution unit... Execution unit Memory Predicates (64) Floating-point registers (128) Execution unit Execution unit... Execution unit Figure 25.6 Hardware organization for IA-64. General and floatingpoint registers are 64-bit wide. Predicates are single-bit registers. Feb. 2011 Computer Architecture, Advanced Architectures Slide 11

25.4 Speculation and Value Prediction load spec load check load (a) Control speculation store load spec load store check load (b) Data speculation Figure 25.7 Examples of software speculation in IA-64. Feb. 2011 Computer Architecture, Advanced Architectures Slide 12

Value Prediction Memo table Inputs Mult/ Div Miss Mux 0 1 Output Done Inputs ready Control Output ready Figure 25.8 Value prediction for multiplication or division via a memo table. Feb. 2011 Computer Architecture, Advanced Architectures Slide 13

25.5 Special-Purpose Hardware Accelerators Data and program memory CPU Configuration memory FPGA-like unit on which accelerators can be formed via loading of configuration registers Accel. 1 Accel. 2 Unused resources Accel. 3 Figure 25.9 General structure of a processor with configurable hardware accelerators. Feb. 2011 Computer Architecture, Advanced Architectures Slide 14

Graphic Processors, Network Processors, etc. PE 0 PE 1 PE 2 PE 3 Input buffer PE 4 PE 8 PE PE 5 PE 9 PE 6 PE 10 PE 7 PE 11 Output buffer PE 12 PE 13 PE 14 PE 15 Feedback path Column memory Column memory Column memory Column memory Figure 25.10 Simplified block diagram of Toaster2, Cisco Systems network processor. Feb. 2011 Computer Architecture, Advanced Architectures Slide 15

25.6 Vector, Array, and Parallel Processing Single data stream Multiple data streams Shared variables Message passing Single instr stream Multiple instr streams SISD Uniprocessors MISD Rarely used SIMD Array or vector processors MIMD Multiproc s or multicomputers Johnson s expansion Global memory Distributed memory GMSV Shared-memory multiprocessors DMSV Distributed shared memory GMMP Rarely used DMMP Distrib-memory multicomputers Flynn s categories Figure 25.11 The Flynn-Johnson classification of computer systems. Feb. 2011 Computer Architecture, Advanced Architectures Slide 16

SIMD Architectures Data parallelism: executing one operation on multiple data streams Concurrency in time vector processing Concurrency in space array processing Example to provide context Multiplying a coefficient vector by a data vector (e.g., in filtering) y[i] := c[i] x[i], 0 i < n Sources of performance improvement in vector processing One instruction is fetched and decoded for the entire operation The multiplications are known to be independent (no checking) Pipelining/concurrency in memory access as well as in arithmetic Array processing is similar Feb. 2011 Computer Architecture, Advanced Architectures Slide 17

MISD Architecture Example I n s t r u c t i o n s t r e a m s 1-5 Data in Data out Figure 25.12 Multiple instruction streams operating on a single data stream (MISD). Feb. 2011 Computer Architecture, Advanced Architectures Slide 18

MIMD Architectures Control parallelism: executing several instruction streams in parallel GMSV: Shared global memory symmetric multiprocessors DMSV: Shared distributed memory asymmetric multiprocessors DMMP: Message passing multicomputers Processors Memory modules Memories and processors Routers Processortoprocessor network 0 0 1 Processorto- 1 memory network p 1...... Parallel I/O... m 1 Parallel input/output A computing node 0 1 Inter- p 1. connection network... Figure 27.1 Centralized shared memory. Figure 28.1 Distributed memory. Feb. 2011 Computer Architecture, Advanced Architectures Slide 19

Amdahl s Law Revisited Speedup (s ) 50 40 30 20 10 0 f = 0 f = 0.01 f = 0.02 f = 0.05 f = 0.1 0 10 20 30 40 50 Enhancement factor (p ) f = sequential fraction p = speedup of the rest with p processors s = 1 f + (1 f)/p min(p, 1/f) Figure 4.4 Amdahl s law: speedup achieved if a fraction f of a task is unaffected and the remaining 1 f part runs p times as fast. Feb. 2011 Computer Architecture, Advanced Architectures Slide 20

26 Vector and Array Processing Single instruction stream operating on multiple data streams Data parallelism in time = vector processing Data parallelism in space = array processing Topics in This Chapter 26.1 Operations on Vectors 26.2 Vector Processor Implementation Feb. 2011 Computer Architecture, Advanced Architectures Slide 21

26.1 Operations on Vectors Sequential processor: for i = 0 to 63 do P[i] := W[i] D[i] endfor for i = 0 to 63 do X[i+1] := X[i] + Z[i] Y[i+1] := X[i+1]+Y[i] endfor Vector processor: load W load D P := W D store P Unparallelizable Feb. 2011 Computer Architecture, Advanced Architectures Slide 22

26.2 Vector Processor Implementation From scalar registers Function unit 1 pipeline To and from memory unit Load unit A Load unit B Store unit Vector register file Function unit 2 pipeline Function unit 3 pipeline Figure 26.1 Forwarding muxes Simplified generic structure of a vector processor. Feb. 2011 Computer Architecture, Advanced Architectures Slide 23

Example Array Processor Control Processor array Switches Control broadcast Parallel I/O Figure 26.7 Array processor with 2D torus interprocessor communication network. Feb. 2011 Computer Architecture, Advanced Architectures Slide 24

27 Shared-Memory Multiprocessing Multiple processors sharing a memory unit seems naïve Didn t we conclude that memory is the bottleneck? How then does it make sense to share the memory? Topics in This Chapter 27.1 Centralized Shared Memory 27.2 Multiple Caches and Cache Coherence 27.3 Implementing Symmetric Multiprocessors 27.6 Implementing Asymmetric Multiprocessors Feb. 2011 Computer Architecture, Advanced Architectures Slide 25

Parallel Processing as a Topic of Study An important area of study that allows us to overcome fundamental speed limits Our treatment of the topic is quite brief (Chapters 26-27) Graduate course ECE 254B: Adv. Computer Architecture Parallel Processing Feb. 2011 Computer Architecture, Advanced Architectures Slide 26

27.1 Centralized Shared Memory Processors Memory modules 0 0 Processortoprocessor network 1 Processorto- 1 memory network...... p 1... Parallel I/O Figure 27.1 Structure of a multiprocessor with centralized shared-memory. m 1 Feb. 2011 Computer Architecture, Advanced Architectures Slide 27

27.2 Multiple Caches and Cache Coherence Processors Caches Memory modules 0 0 Processortoprocessor network 1 Processorto- 1 memory network...... p 1... Parallel I/O m 1 Private processor caches reduce memory access traffic through the interconnection network but lead to challenging consistency problems. Feb. 2011 Computer Architecture, Advanced Architectures Slide 28

27.3 Implementing Symmetric Multiprocessors Computing nodes (typically, 1-4 CPUs and caches per node) Interleaved memory I/O modules Standard interfaces Bus adapter Bus adapter Very wide, high-bandwidth bus Figure 27.6 Structure of a generic bus-based symmetric multiprocessor. Feb. 2011 Computer Architecture, Advanced Architectures Slide 29

27.6 Implementing Asymmetric Multiprocessors To I/O controllers Computing nodes (typically, 1-4 CPUs and associated memory) Node 0 Node 1 Node 2 Node 3 Link Memory Link Link Link Ring network Figure 27.11 Structure of a ring-based distributed-memory multiprocessor. Feb. 2011 Computer Architecture, Advanced Architectures Slide 30

28 Distributed Multicomputing Computer architects dream: connect computers like toy blocks Building multicomputers from loosely connected nodes Internode communication is done via message passing Topics in This Chapter 28.1 Communication by Message Passing 28.2 Interconnection Networks 28.3 Message Composition and Routing 28.6 Grid Computing and Beyond Feb. 2011 Computer Architecture, Advanced Architectures Slide 31

28.1 Communication by Message Passing Memories and processors Routers Parallel input/output A computing node 0 1 Inter- p 1. connection network Figure 28.1 Structure of a distributed multicomputer. Feb. 2011 Computer Architecture, Advanced Architectures Slide 32

Router Design Injection channel Ejection channel LC LC Link controller Routing and arbitration Q Q Message queue Input queues Output queues Input channels LC LC LC LC Q Q Q Q Switch Q Q Q Q LC LC LC LC Output channels Figure 28.2 The structure of a generic router. Feb. 2011 Computer Architecture, Advanced Architectures Slide 33

Interprocess Communication via Messages Process A Process B Communication latency Time.................. send x........................... receive x......... Process B is suspended Process B is awakened Figure 28.4 Use of send and receive message-passing primitives to synchronize two processes. Feb. 2011 Computer Architecture, Advanced Architectures Slide 34

28.2 Interconnection Networks Nodes Routers Nodes (a) Direct network (b) Indirect network Figure 28.5 Examples of direct and indirect interconnection networks. Feb. 2011 Computer Architecture, Advanced Architectures Slide 35

Direct Interconnection Networks (a) 2D torus (b) 4D hypercube (c) Chordal ring (d) Ring of rings Figure 28.6 A sampling of common direct interconnection networks. Only routers are shown; a computing node is implicit for each router. Feb. 2011 Computer Architecture, Advanced Architectures Slide 36

Indirect Interconnection Networks Level-3 bus Level-2 bus Level-1 bus (a) Hierarchical buses (b) Omega network Figure 28.7 Two commonly used indirect interconnection networks. Feb. 2011 Computer Architecture, Advanced Architectures Slide 37

28.3 Message Composition and Routing Message Padding Packet data A transmitted packet First packet Header Data or payload Last packet Trailer Flow control digits (flits) Figure 28.8 Messages and their parts for message passing. Feb. 2011 Computer Architecture, Advanced Architectures Slide 38

Computing in the Cloud Computational resources, both hardware and software, are provided by, and managed within, the cloud Users pay a fee for access Managing / upgrading is much more efficient in large, centralized facilities (warehouse-sized data centers or server farms) Image from Wikipedia This is a natural continuation of the outsourcing trend for special services, so that companies can focus their energies on their main business Feb. 2011 Computer Architecture, Advanced Architectures Slide 39

Warehouse-Sized Data Centers Image from IEEE Spectrum, June 2009 Feb. 2011 Computer Architecture, Advanced Architectures Slide 40