NOW Handout Page 1 NO! Today s Goal: CS 258 Parallel Computer Architecture. What will you get out of CS258? Will it be worthwhile?

Similar documents
Parallel Computer Architecture

Lecture1: Introduction. Administrative info

Introduction. What is Parallel Architecture? Why Parallel Architecture? Evolution and Convergence of Parallel Architectures. Fundamental Design Issues

Chapter 2: Computer-System Structures. Hmm this looks like a Computer System?

Uniprocessor Computer Architecture Example: Cray T3E

Number of processing elements (PEs). Computing power of each element. Amount of physical memory used. Data access, Communication and Synchronization

Why Parallel Architecture

Parallel Computing. Parallel Computing. Hwansoo Han

ECE 669 Parallel Computer Architecture

Convergence of Parallel Architecture

ECE 669 Parallel Computer Architecture

Lecture 1: Introduction

Conventional Computer Architecture. Abstraction

Multi-core Programming - Introduction

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

ECE 588/688 Advanced Computer Architecture II

Lecture 2: Technology Trends Prof. Randy H. Katz Computer Science 252 Spring 1996

Parallel Arch. Review

What are Clusters? Why Clusters? - a Short History

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

What is a parallel computer?

CS267 / E233 Applications of Parallel Computers. Lecture 1: Introduction 1/18/99

ECE 588/688 Advanced Computer Architecture II

EE382 Processor Design. Class Objectives

Three basic multiprocessing issues

Lecture 1: Course Introduction and Overview Prof. Randy H. Katz Computer Science 252 Spring 1996

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Overview. CS 472 Concurrent & Parallel Programming University of Evansville

ECE 468 Computer Architecture and Organization Lecture 1

Computer Systems Architecture

How What When Why CSC3501 FALL07 CSC3501 FALL07. Louisiana State University 1- Introduction - 1. Louisiana State University 1- Introduction - 2

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer

Introduction to Parallel Processing

Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core. Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture

CPS104 Computer Organization and Programming Lecture 20: Superscalar processors, Multiprocessors. Robert Wagner

CMSC 611: Advanced. Parallel Systems

Instructor Information

Lecture 1: Parallel Architecture Intro

Hakam Zaidan Stephen Moore

CISC 360. Computer Architecture. Seth Morecraft Course Web Site:

Issues in Multiprocessors

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Parallel Programming Models and Architecture

Memory Systems IRAM. Principle of IRAM

Computer Systems Architecture

Parallel Architecture Fundamentals

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

Fundamentals of Computer Design

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

In the early days of computing, the best way to increase the speed of a computer was to use faster logic devices.

Perspectives on the Memory Wall. John D. McCalpin, Ph.D IBM Global Microprocessor Development Austin, TX

Computer Architecture. Fall Dongkun Shin, SKKU

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

Fundamentals of Computers Design

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 1. Computer Abstractions and Technology

CPE/EE 421 Microcomputers

Chapter 1. Introduction To Computer Systems

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Evolution of Computers & Microprocessors. Dr. Cahit Karakuş

This Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture

Scalable Distributed Memory Machines

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Computer Architecture Computer Architecture. Computer Architecture. What is Computer Architecture? Grading

Performance of computer systems

Parallel Architecture. Hwansoo Han

AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors

Alternate definition: Instruction Set Architecture (ISA) What is Computer Architecture? Computer Organization. Computer structure: Von Neumann model

Computer Architecture

Parallel Computing Platforms

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Outline Marquette University

EE282H: Computer Architecture and Organization. EE282H: Computer Architecture and Organization -- Course Overview

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture?

Review: Performance Latency vs. Throughput. Time (seconds/program) is performance measure Instructions Clock cycles Seconds.

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. What is Computer Architecture? Sources

Multiple Issue ILP Processors. Summary of discussions

Lecture 2: Computer Performance. Assist.Prof.Dr. Gürhan Küçük Advanced Computer Architectures CSE 533

Computer Architecture

Intel Enterprise Processors Technology

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Handout 3 Multiprocessor and thread level parallelism

Tutorial 11. Final Exam Review

CS 352H Computer Systems Architecture Exam #1 - Prof. Keckler October 11, 2007

Issues in Multiprocessors

NOW Handout Page 1. Recap: Performance Trade-offs. Shared Memory Multiprocessors. Uniprocessor View. Recap (cont) What is a Multiprocessor?

Introduction to Parallel Processing

Introduction to Parallel Processing

Fundamentals of Quantitative Design and Analysis

IT 252 Computer Organization and Architecture. Introduction. Chia-Chi Teng

Lecture 9: MIMD Architectures

Alex Milenkovich 1. CPE/EE 421 Microcomputers. CPE/EE 421 Microcomputers U A H U A H U A H. Instructor: Dr Aleksandar Milenkovic Lecture Notes S01

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Advanced Processor Architecture

Transcription:

Today s Goal: CS 258 Parallel Computer Architecture Introduce you to Parallel Computer Architecture Answer your questions about CS 258 Provide you a sense of the trends that shape the field CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley 1/19/99 CS258 S99 2 What will you get out of CS258? In-depth understanding of the design and engineering of modern parallel computers technology forces fundamental architectural issues» naming, replication, communication, synchronization basic design techniques» cache coherence, protocols, networks, pipelining, methods of evaluation underlying engineering trade-offs from moderate to very large scale across the hardware/software boundary Will it be worthwhile? Absolutely! even through few of you will become PP designers The fundamental issues and solutions translate across a wide spectrum of systems. Crisp solutions in the context of parallel machines. Pioneered at the thin-end of the platform pyramid on the most-demanding applications migrate downward with time Understand implications SuperServers for software Departmenatal Servers Workstations Personal Computers 1/19/99 CS258 S99 3 1/19/99 CS258 S99 4 Am I going to read my book to you? NO! Book provides a framework and complete background, so lectures can be more interactive. You do the reading We ll discuss it Projects will go beyond 1/19/99 CS258 S99 5 What is Parallel Architecture? A parallel computer is a collection of processing elements that cooperate to solve large problems fast Some broad issues: Resource Allocation:» how large a collection?» how powerful are the elements?» how much memory? Data access, Communication and Synchronization» how do the elements cooperate and communicate?» how are data transmitted between processors?» what are the abstractions and primitives for cooperation? Performance and Scalability» how does it all translate into performance?» how does it scale? 1/19/99 CS258 S99 6 NOW Handout Page 1 CS258 S99 1

Why Study Parallel Architecture? Role of a computer architect: To design and engineer the various levels of a computer system to maximize performance and programmability within limits of technology and cost. Parallelism: Provides alternative to faster clock for performance Applies at all levels of system design Is a fascinating perspective from which to view architecture Is increasingly central in information processing Why Study it Today? History: diverse and innovative organizational structures, often tied to novel programming models Rapidly maturing under strong technological constraints The killer micro is ubiquitous Laptops and supercomputers are fundamentally similar! Technological trends cause diverse approaches to converge Technological trends make parallel computing inevitable Need to understand fundamental principles and design tradeoffs, not just taxonomies Naming, Ordering, Replication, Communication performance 1/19/99 CS258 S99 7 1/19/99 CS258 S99 8 Is Parallel Computing Inevitable? Application demands: Our insatiable need for computing cycles Technology Trends Architecture Trends Economics Current trends: Today s microprocessors have multiprocessor support Servers and workstations becoming MP: Sun, SGI, DEC, COMPAQ!... Tomorrow s microprocessors are multiprocessors Application Trends Application demand for performance fuels advances in hardware, which enables new appl ns, which... Cycle drives exponential increase in microprocessor performance Drives parallel architecture harder» most demanding applications New Applications More Performance Range of performance demands Need range of system performance with progressively increasing cost 1/19/99 CS258 S99 9 1/19/99 CS258 S99 Speedup Speedup (p processors) = Performance (p processors) Performance (1 processor) For a fixed problem size (input data set), performance = 1/time Speedup fixed problem (p processors) = Time (1 processor) Time (p processors) Commercial Computing Relies on parallelism for high end Computational power determines scale of business that can be handled Databases, online-transaction processing, decision support, data mining, data warehousing... TPC benchmarks (TPC-C order entry, TPC-D decision support) Explicit scaling criteria provided Size of enterprise scales with size of system Problem size not fixed as p increases. Throughput is performance measure (transactions per minute or tpm) 1/19/99 CS258 S99 11 1/19/99 CS258 S99 12 NOW Handout Page 2 CS258 S99 2

TPC-C Results for March 1996 Throughput (tpmc) 25, Tandem Himalaya DEC Alpha SGI PowerChallenge 2, HP PA IBM PowerPC Other 15, Scientific Computing Demand 5, 2 4 6 8 12 Parallelism is pervasive Number of processors Small to moderate scale parallelism very important Difficult to obtain snapshot to compare across vendor platforms 1/19/99 CS258 S99 13 1/19/99 CS258 S99 14 Engineering Computing Demand Large parallel machines a mainstay in many industries Petroleum (reservoir analysis) Automotive (crash simulation, drag analysis, combustion efficiency), Aeronautics (airflow analysis, engine efficiency, structural mechanics, electromagnetism), Computer-aided design Pharmaceuticals (molecular modeling) Visualization» in all of the above» entertainment (films like Toy Story)» architecture (walk-throughs and rendering) Financial modeling (yield and derivative analysis) etc. 1/19/99 CS258 S99 15 Applications: Speech and Image Processing GIPS 1 GIPS MIPS MIPS 1 MIPS 2 Words Isolated Speech Recognition Sub-Band Speech Coding Telephone Number Recognition Speaker Veri¼cation 5, Words Continuous Words Speech Continuous Recognition Speech HDTVReceiver Recognition CIF Video ISDN-CD Stereo Receiver CELP Speech Coding 198 1985 199 1995 Also CAD, Databases,... processors gets you years, gets you 2! 1/19/99 CS258 S99 16 Is better parallel arch enough? AMBER molecular dynamics simulation program Starting point was vector code for Cray-1 145 MFLOP on Cray9, 46 for final version on 128- processor Paragon, 891 on 128-processor Cray T3D 1/19/99 CS258 S99 17 Summary of Application Trends Transition to parallel computing has occurred for scientific and engineering computing In rapid progress in commercial computing Database and transactions as well as financial Usually smaller-scale, but large-scale systems also used Desktop also uses multithreaded programs, which are a lot like parallel programs Demand for improving throughput on sequential workloads Greatest use of small-scale multiprocessors Solid application demand exists and will increase 1/19/99 CS258 S99 18 NOW Handout Page 3 CS258 S99 3

- - - Little break - - - Technology Trends Supercomputers Performance 1 Mainframes Minicomputers Microprocessors.1 1965 197 1975 198 1985 199 1995 Today the natural building-block is also fastest! 1/19/99 CS258 S99 19 1/19/99 CS258 S99 2 Can t we just wait for it to get faster? Microprocessor performance increases 5% - % per year Transistor count doubles every 3 years DRAM size quadruples every 3 years Huge investment per generation is carried by huge commodity market 18 16 14 12 8 6 4 2 Sun 4 26 MIPS M/12 MIPS M2 IBM RS6 54 HP 9 75 1987 1988 1989 199 1991 1992 Integer 1/19/99 CS258 S99 21 DEC alpha FP Technology: A Closer Look Basic advance is decreasing feature size ( l ) Circuits become either faster or lower in power Die size is growing too Clock rate improves roughly proportional to improvement in l Number of transistors improves like l 2 (or faster) Performance > x per decade clock rate < x, rest is transistor count How to use more transistors? Parallelism in processing Proc $» multiple operations per cycle reduces CPI Locality in data access» avoids latency and reduces CPI» also improves processor utilization Both need resources, so tradeoff Fundamental issue is resource distribution, as in uniprocessors Interconnect 1/19/99 CS258 S99 22 Clock rate (MHz) Growth Rates R Pentium i8386 i886 i8286 i88 1 i88 i44.1 197 198 199 2 1975 1985 1995 25 3% per year Transistors,,, R Pentium, i8386 i8286, R2 R3 i886 i88 i88 i44 197 198 199 2 1975 1985 1995 25 4% per year 1/19/99 CS258 S99 23 Architectural Trends Architecture translates technology s gifts into performance and capability Resolves the tradeoff between parallelism and locality Current microprocessor: 1/3 compute, 1/3 cache, 1/3 off-chip connect Tradeoffs may change with scale and technology advances Understanding microprocessor architectural trends => Helps build intuition about design issues or parallel machines => Shows fundamental role of parallelism even in sequential computers 1/19/99 CS258 S99 24 NOW Handout Page 4 CS258 S99 4

Phases in VLSI Generation Transistors Bit-level parallelism Instruction-level Thread-level (?),,, R, Pentium i8386 i8286 R3, R2 i886 i88 i88 i44 197 1975 198 1985 199 1995 2 25 1/19/99 CS258 S99 25 Architectural Trends Greatest trend in VLSI generation is increase in parallelism Up to 1985: bit level parallelism: 4-bit -> 8 bit -> 16-bit» slows after 32 bit» adoption of 64-bit now under way, 128-bit far (not performance issue)» great inflection point when 32-bit micro and cache fit on a chip Mid 8s to mid 9s: instruction level parallelism» pipelining and simple instruction sets, + compiler advances (RISC)» on-chip caches and functional units => superscalar execution» greater sophistication: out of order execution, speculation, prediction to deal with control transfer and latency problems Next step: thread level parallelism 1/19/99 CS258 S99 26 How far will ILP go? Threads Level Parallelism on board Fraction of total cycles (%) 3 25 2 15 5 Speedup 3 2.5 2 1.5 1.5 Proc Proc Proc Proc MEM 1 2 3 4 5 6+ Number of instructions issued 5 15 1/19/99 CS258 S99 27 Instructions issued per cycle Infinite resources and fetch bandwidth, perfect branch prediction and renaming real caches and non-zero miss latencies Micro on a chip makes it natural to connect many to shared memory dominates server and enterprise market, moving down to desktop Faster processors began to saturate bus, then bus technology advanced today, range of sizes for bus-based systems, desktop to large servers 1/19/99 CS258 S99 28 No. of processors in fully configured commercial shared-memory systems What about Multiprocessor Trends? 7 Bus Bandwidth, 6 CRAY CS64 Sun E Sun E 5 Number of processors 4 3 2 Sequent B2 Symmetry81 Sun SC2 SGI Challenge Sun E6 SE7 SC2E SGI PowerChallenge/XL AS84 Sequent B8 Symmetry21 SE SE3 Power SS SSE SS69MP 14 AS2HP K4 P-Pro SGI PowerSeries SS69MP 12 SS SS2 1984 1986 1988 199 1992 1994 1996 1998 1/19/99 CS258 S99 29 SE6 Shared bus bandwidth (MB/s) SGI Sun E6 PowerCh XL AS84 SGI Challenge CS64 HPK4 SC2E SC2 AS2 P-Pro SSE SS69MP 12 SS SS2 SS/ SE7/SE3 SS69MP 14 SE/ Symmetry81/21 SE6 SGI PowerSeries Power Sequent B2 Sequent B8 1984 1986 1988 199 1992 1994 1996 1998 1/19/99 CS258 S99 3 NOW Handout Page 5 CS258 S99 5

What about Storage Trends? Divergence between memory capacity and speed even more pronounced Capacity increased by x from 198-95, speed only 2x Gigabit DRAM by c. 2, but gap with processor speed much greater Larger memories are slower, while processors get faster Need to transfer more data in parallel Need deeper cache hierarchies How to organize caches? Parallelism increases effective size of each level of hierarchy, without increasing access time Parallelism and locality within memory systems too New designs fetch many bits within memory chip; follow with fast pipelined transfer across narrower interface Buffer caches most recently accessed data Disks too: Parallel disks plus caching Economics Commodity microprocessors not only fast but CHEAP Development costs tens of millions of dollars BUT, many more are sold compared to supercomputers Crucial to take advantage of the investment, and use the commodity building block Multiprocessors being pushed by software vendors (e.g. database) as well as hardware vendors Standardization makes small, bus-based SMPs commodity Desktop: few smaller processors versus one larger one? Multiprocessor on a chip? 1/19/99 CS258 S99 31 1/19/99 CS258 S99 32 Can we see some hard evidence? Consider Scientific Supercomputing Proving ground and driver for innovative architecture and techniques Market smaller relative to commercial as MPs become mainstream Dominated by vector machines starting in 7s Microprocessors have made huge gains in floating-point performance» high clock rates» pipelined floating point units (e.g., multiply-add every cycle)» instruction-level parallelism» effective use of caches (e.g., automatic blocking) Plus economics Large-scale multiprocessors replace vector supercomputers 1/19/99 CS258 S99 33 1/19/99 CS258 S99 34 Raw Uniprocessor Performance: LINPACK Raw Parallel Performance: LINPACK LINPACK (MFLOPS) CRAY n = CRAY n = Micro n = Micro n = CRAY 1s Xmp/14se Ymp Xmp/416 T94 C9 DEC 82 IBM Power2/99 MIPS R44 DEC Alpha HP9/735 DEC Alpha AXP HP 9/75 IBM RS6/54 LINPACK (GFLOPS) 1 Xmp /416(4) MPP peak CRAY peak CM-2 Ymp/832(8) CM-2 ipsc/86 ncube/2(24) Delta T932(32) Paragon XP/S C9(16) ASCI Red Paragon XP/S MP (6768) Paragon XP/S MP (24) CM-5 T3D MIPS M/2 MIPS M/12 Sun 4/26 1 1975 198 1985 199 1995 2 1/19/99 CS258 S99 35.1 1985 1987 1989 1991 1993 1995 1996 Even vector Crays became parallel X-MP (2-4) Y-MP (8), C-9 (16), T94 (32) Since 1993, Cray produces MPPs too (T3D, T3E) 1/19/99 CS258 S99 36 NOW Handout Page 6 CS258 S99 6

5 Fastest Computers Number of systems 35 313 3 25 2 187 15 5 239 198 63 284 MPP PVP SMP 319 1 6 6 73 11/93 11/94 11/95 11/96 1/19/99 CS258 S99 37 Summary: Why Parallel Architecture? Increasingly attractive Economics, technology, architecture, application demand Increasingly central and mainstream Parallelism exploited at many levels Instruction-level parallelism Multiprocessor servers Large-scale multiprocessors ( MPPs ) Focus of this class: multiprocessor level of parallelism Same story from memory system perspective Increase bandwidth, reduce average latency with many local memories Spectrum of parallel architectures make sense Different cost, performance and scalability 1/19/99 CS258 S99 38 Where is Parallel Arch Going? Old view: Divergent architectures, no predictable pattern of growth. Systolic Arrays Dataflow Application Software System Software Architecture Shared Memory SIMD Message Passing Today Extension of computer architecture to support communication and cooperation Instruction Set Architecture plus Communication Architecture Defines Critical abstractions, boundaries, and primitives (interfaces) Organizational structures that implement interfaces (hw or sw) Compilers, libraries and OS are important bridges today Uncertainty of direction paralyzed parallel software development! 1/19/99 CS258 S99 39 1/19/99 CS258 S99 4 Modern Layered Framework How will we spend out time? CAD Database Scientific modeling Parallel applications Multiprogramming Shared address Message passing Data parallel Programming models Compilation or library Communication abstraction User/system boundary Operating systems support http://www.cs.berkeley.edu/~culler/cs258-s99/schedule.html Communication hardware Physical communication medium Hardware/software boundary 1/19/99 CS258 S99 41 1/19/99 CS258 S99 42 NOW Handout Page 7 CS258 S99 7

How will grading work? Any other questions? 3% homeworks (6) 3% exam 3% project (teams of 2) % participation 1/19/99 CS258 S99 43 1/19/99 CS258 S99 44 NOW Handout Page 8 CS258 S99 8