Review. CS 258 Parallel Computer Architecture Lecture 2. Convergence of Parallel Architectures. Plan for Today. History

Size: px

Start display at page:

Download "Review. CS 258 Parallel Computer Architecture Lecture 2. Convergence of Parallel Architectures. Plan for Today. History"

Morris Tate
5 years ago
Views:

1 S 258 arallel omputer rchitecture Lecture 2 onvergence of arallel rchitectures January 28, 2008 rof John D. Kubiatowicz Review Industry has decided that ultiprocessing is the future/best use of transistors Every major chip manufacturer now making ultiore chips History of microprocessor architecture is parallelism translates area and density into performance The Future is higher levels of parallelism arallel rchitecture concepts apply at many levels ommunication also on exponential curve roper way to compute speedup Incorrect way to measure:» ompare parallel program on 1 processor to parallel program on p processors Instead:» Should compare uniprocessor program on 1 processor to parallel program on p processors Lec 2.2 History arallel architectures tied closely to programming models Divergent architectures, with no predictable pattern of growth. id 80s renaissance lan for Today Look at major programming models where did they come from? The 80s architectural rennaisance! What do they provide? How have they converged? Extract general structure and fundamental issues pplication Software Systolic rrays Dataflow System Software SID rchitecture essage assing Shared emory Lec 2.3 Systolic rrays Dataflow Generic SID rchitecture essage assing Shared emory Lec 2.4

2 rogramming odel onceptualization of the machine that programmer uses in coding applications How parts cooperate and coordinate their activities Specifies communication and synchronization operations Shared emory Shared ddr. Space rocessor rocessor rocessor rocessor ultiprogramming no communication or synch. at program level Shared address space like bulletin board essage passing like letters or phone calls, explicit point to point Data parallel: more regimented, global actions on data Implemented with shared address space or message passing Lec 2.5 rocessor rocessor Shared emory rocessor Range of addresses shared by all processors ll communication is Implicit (Through memory) Want to communicate a bunch of info? ass pointer. rogramming is straightforward Generalization of multithreaded programming rocessor Lec 2.6 Historical Development dding rocessing apacity ainframe approach otivated by multiprogramming Extends crossbar used for em and rocessor cost-limited => crossbar Bandwidth scales with p High incremental cost» use multistage instead em em em em ctrl Interconnect Interconnect ctrl devices inicomputer approach lmost all microprocessor systems have bus otivated by multiprogramming, T Used heavily for parallel computing alled symmetric multiprocessor (S) Latency larger than for uniprocessor Bus is bandwidth bottleneck» caching is key: coherence problem Low incremental cost rocessor rocessor emory capacity increased by adding modules by controllers and devices dd processors for processing! For higher-throughput multiprogramming, or parallel programs Lec 2.7 Lec 2.8

3 Shared hysical emory ny processor can directly reference any location ommunication operation is load/store Special operations for synchronization ny controller - any memory Operating system can run on any processor, or all. OS uses shared memory to coordinate What about application processes? Shared Virtual ddress Space rocess = address space plus thread of control Virtual-to-physical mapping can be established so that processes shared portions of address space. User-kernel or multiple processes ultiple threads of control on one address space. opular approach to structuring OS s Now standard application capability (ex: OSIX threads) Writes to shared address visible to other threads Natural extension of uniprocessors model conventional memory operations for communication special atomic operations for synchronization» also load/stores Lec 2.9 Lec 2.10 Structured Shared ddress Space n ache oherence roblem Virtual address spaces for a collection of processes communicating via shared addresses Load achine physical address space n pr i vat e W R? R? 1 2 ommon physical addresses Write-Through? 0 St or e Shared portion of address space rivate portion of address space 2 pr i vat e 1 pr i vat e 0 pr i vat e dd hoc parallelism used in system code ost parallel applications have structured SS Same program on each processor shared variable X means the same thing to each thread aches are aliases for memory locations Does every processor eventually see new value? Tightly related: ache onsistency In what order do writes appear to other processors? Buses make this easy: every processor can snoop on every write Essential feature: Broadcast iss Lec 2.11 Lec 2.12

Engineering: Intel entium ro Quad Engineering: SUN Enterprise U -ro -ro -ro Interrupt 256-KB module module module controller L 2 Bus interface -ro bus (64-bit data, 36-bit address, 66 Hz) 2 2 em ctrl

targeted at high volume Low latency and bandwidth roc + mem card - card 16 cards of either type ll memory accessed over bus, so symmetric Higher bandwidth, higher latency bus Gigaplane bus (256 data,

14 Quad-rocessor Xeon rchitecture Scaling Up Omega General Dance hall Distributed memory ll sharing through pairs of front side busses (FSB) emory traffic/cache misses through single chipset to

4 Engineering: Intel entium ro Quad Engineering: SUN Enterprise U -ro -ro -ro Interrupt 256-KB module module module controller L 2 Bus interface -ro bus (64-bit data, 36-bit address, 66 Hz) 2 2 em ctrl U/mem cards I bridge I bridge emory controller Bus interface/switch I cards I bus I bus IU 1-, 2-, or 4-way interleaved DR ll coherence and multiprocessing glue in processor module Highly integrated, targeted at high volume Low latency and bandwidth roc + mem card - card 16 cards of either type ll memory accessed over bus, so symmetric Higher bandwidth, higher latency bus Gigaplane bus (256 data, 41 address, 83 Hz) 100bT, SSI Bus interface SBUS SBUS SBUS 2 Fiberhannel cards Lec 2.13 Lec 2.14 Quad-rocessor Xeon rchitecture Scaling Up Omega General Dance hall Distributed memory ll sharing through pairs of front side busses (FSB) emory traffic/cache misses through single chipset to memory Example Blackford chipset roblem is interconnect: cost (crossbar) or bandwidth (bus) Dance-hall: bandwidth still scalable, but lower cost than crossbar» latencies to memory uniform, but uniformly large Distributed memory or non-uniform memory access (NU)» onstruct shared address space out of simple message transactions across a general-purpose network (e.g. readrequest, read-response) aching shared (particularly nonlocal) data? Lec 2.15 Lec 2.16

Stanford DSH The IT lewife ultiprocessor lusters of 4 processors share 2 nd -level cache Up to

Each memory line has home cluster that contains DR The 16-bit vector says which clusters (if

processors) working at one time: synchronous network probs! L1- L1- L1- L2-ache L1- Lec 2.

Limited Directory + software overflow User-level essage-assing Rapid ontext-switching

18 Engineering: ray T3E D Direct onnect External em em ctrl and NI XY Switch Z Scale up to 1024

hardware mechanism for coherence» SGI Origin etc. provide this Lec 2.

5 Stanford DSH The IT lewife ultiprocessor lusters of 4 processors share 2 nd -level cache Up to 16 clusters tied together with 2-dim mesh 16-bit directory associated with every memory line Each memory line has home cluster that contains DR The 16-bit vector says which clusters (if any) have read copies Only one writer permitted at a time Never got more than 12 clusters (48 processors) working at one time: synchronous network probs! L1- L1- L1- L2-ache L1- Lec 2.17 ache-coherence Shared emory artially in Software! Limited Directory + software overflow User-level essage-assing Rapid ontext-switching 2-dimentional synchronous network One node/board Got 32-processors (+ boards) working Lec 2.18 Engineering: ray T3E D Direct onnect External em em ctrl and NI XY Switch Z Scale up to 1024 processors, 480B/s links emory controller generates request message for non-local references No hardware mechanism for coherence» SGI Origin etc. provide this Lec 2.19 ommunication over general interconnect Shared memory/address space traffic over network traffic to memory over network ultiple topology options (seems to scale to 8 or 16 processor chips) Lec 2.20

6 What is underlying Shared emory?? essage assing rchitectures Systolic rrays Generic rchitecture SID essage assing Dataflow Shared emory acket switched networks better utilize available link bandwidth than circuit switched networks So, network passes messages around! omplete computer as building block, including ommunication via Explicit operations rogramming model direct access only to private address space (local memory), communication via explicit messages (send/receive) High-level block diagram ommunication integration?» em,, LN, luster Easier to build and scale than SS rogramming model more removed from basic hardware operations Library or OS intervention Lec 2.21 Lec 2.22 essage-assing bstraction atch Receive Y,,t ddressy Send X, Q, t Evolution of essage-assing achines Early machines: FIFO on each link HW close to prog. odel; synchronous ops topology central (hypercube algorithms) ddressx Local process address space Local process address space rocess rocessq Send specifies buffer to be transmitted and receiving process Recv specifies sending process and application storage to receive into emory to memory copy, but need to name processes Optional tag on send and matching rule on receive User process names local data and entities in process/tag space too In simplest form, the send/recv match achieves pairwise synch event» Other variants too any overheads: copying, buffer management, protection altech osmic ube (Seitz, Jan 95) Lec 2.23 Lec 2.24

IT J-achine (Jelly-bean machine) 3-dimensional network topology Non-adaptive, E-cubed routing Hardware routing aximize

ssociative array primitives to aid in synthesizing shared-address space Extremely fine-grained communication

Shift to general links D, enabling non-blocking ops» Buffered by system at destination until recv Store&forward

interface dominates communication time» fast relative to overhead» Will this change for anyore?

26 Example Intel aragon Building on the mainstream: IB S-2 Sandia s Intel aragon X/S-based Supercomputer i860 L 1 i860

complete RS6000 workstations interface integrated in bus (bw limited by bus) General interconnection network formed

7 IT J-achine (Jelly-bean machine) 3-dimensional network topology Non-adaptive, E-cubed routing Hardware routing aximize density of communication 64-nodes/board, 1024 nodes total Low-powered processors essage passing instructions ssociative array primitives to aid in synthesizing shared-address space Extremely fine-grained communication Hardware-supported ctive essages Lec 2.25 Diminishing Role of Topology? Shift to general links D, enabling non-blocking ops» Buffered by system at destination until recv Store&forward routing Fault-tolerant, multi-path routing: Diminishing role of topology ny-to-any pipelined routing node-network interface dominates communication time» fast relative to overhead» Will this change for anyore? Simplifies programming llows richer design space» grids vs hypercubes Intel is/1 -> is/2 -> is/860 Lec 2.26 Example Intel aragon Building on the mainstream: IB S-2 Sandia s Intel aragon X/S-based Supercomputer i860 L 1 i860 L 1 emory bus (64-bit, 50 Hz) em ctrl 4-way interleaved DR Driver Intel aragon node D NI ade out of essentially complete RS6000 workstations interface integrated in bus (bw limited by bus) General interconnection network formed from 8-port switches ower 2 U IB S-2 node L 2 emory bus emory 4-way interleaved controller DR 2D grid network with processing node attached to every switch 8 bits, 175 Hz, bidirectional icrohannel bus NI D i860 NI DR Lec 2.27 Lec 2.28

Berkeley NOW Data arallel Systems 100 Sun Ultra2 workstations Inteligent network interface proc + mem yrinet 160 B/s per link 300 ns per hop rogramming model

processor associated with each data element rchitectural model rray of many simple, cheap processors with little memory each» rocessors don t sequence through

atches simple differential equation solvers entralize high cost of instruction fetch/sequencing ontrol processor E E E E E E Lec 2.29 E E E Lec 2.

8 Berkeley NOW Data arallel Systems 100 Sun Ultra2 workstations Inteligent network interface proc + mem yrinet 160 B/s per link 300 ns per hop rogramming model Operations performed in parallel on each element of data structure Logically single thread of control, performs sequential or parallel steps onceptually, a processor associated with each data element rchitectural model rray of many simple, cheap processors with little memory each» rocessors don t sequence through instructions ttached to a control processor that issues instructions Specialized and general communication, cheap global synchronization Original motivations atches simple differential equation solvers entralize high cost of instruction fetch/sequencing ontrol processor E E E E E E Lec 2.29 E E E Lec 2.30 pplication of Data arallelism onnection achine Each E contains an employee record with his/her salary If salary > 100K then salary = salary *1.05 else salary = salary *1.10 Logically, the whole operation is a single step Some processors enabled for arithmetic operation, others disabled Other examples: Finite differences, linear algebra,... Document searching, graphics, image processing,... Some recent machines: Thinking achines -1, -2 (and -5) aspar -1 and -2, (Tucker, IEEE omputer, ug. 1988) Lec 2.31 Lec 2.32

NVidia Tesla rchitecture ombined GU and general U omponents of NVidia Tesla architecture S has 8 S thread processor cores 32 GFLOS peak at 1.

instruction ultithreaded Instruction Unit 768 independent threads per S HW multithreading & scheduling 16KB Shared emory oncurrent threads share data Low latency load/store Full GU Total

9 NVidia Tesla rchitecture ombined GU and general U omponents of NVidia Tesla architecture S has 8 S thread processor cores 32 GFLOS peak at 1.35 GHz IEEE bit floating point 32-bit, 64-bit integer 2 SFU special function units Scalar IS emory load/store/atomic Texture fetch Branch, call, return Barrier synchronization instruction ultithreaded Instruction Unit 768 independent threads per S HW multithreading & scheduling 16KB Shared emory oncurrent threads share data Low latency load/store Full GU Total performance > 500GOps Lec 2.33 Lec 2.34 Evolution and onvergence -5 SID opular when cost savings of centralized sequencer high 60s when U was a cabinet Replaced by vectors in mid-70s» ore flexible w.r.t. memory layout and easier to manage Revived in mid-80s when 32-bit datapath slices just fit on chip Simple, regular applications have good locality rogramming model converges with SD (single program multiple data) need fast global synchronization Structured global address space, implemented with either SS or Repackaged SparcStation 4 per board Fat-Tree network ontrol network for global synchronization Lec 2.35 Lec 2.36

Systolic rrays Dataflow Generic SID rchitecture essage assing Shared emory Dataflow rchitectures Represent computation as a graph of essential dependences Logical processor at each node, activated by

10 Systolic rrays Dataflow Generic SID rchitecture essage assing Shared emory Dataflow rchitectures Represent computation as a graph of essential dependences Logical processor at each node, activated by availability of operands essage (tokens) carrying tag of next instruction sent to next processor Tag compared with others in matching store; match fires execution a = (b +1) (b c) d = c e f = a d 1 b + a f c d e Dataflow graph Lec 2.37 Token store Waiting atching rogram store Instruction fetch Execute Token queue Form token 1/28/08 Kubiatowicz S258 UB Spring 2008 onsoon (IT) Lec 2.38 Evolution and onvergence Systolic rchitectures Key characteristics bility to name operations, synchronization, dynamic scheduling roblems Operations have locality across them, useful to group together Handling complex data structures like arrays omplexity of matching store and memory units Expose too much parallelism (?) onverged to use conventional processors and memory Support for large, dynamic set of threads to map to processors Typically shared address space as well But separation of progr. model from hardware (like data-parallel) Lasting contributions: Integration of communication with thread (handler) generation Tightly integrated communication and fine-grained synchronization Remained useful concept for software (compilers etc.) VLSI enables inexpensive special-purpose chips Represent algorithms directly by chips connected in regular pattern Replace single processor with array of regular processing elements Orchestrate data flow for high throughput with less memory access E Different from pipelining Nonlinear array structure, multidirection data flow, each E may have (small) local instruction and data memory SID? : each E may do something different E E E Lec 2.39 Lec 2.40

11 Systolic rrays (contd.) Example: Systolic array for 1-D convolution x8 x7 y(i) = w1 x(i) + w2 x(i + 1) + w3 x(i + 2) + w4 x(i + 3) x6 x5 x4 y3 y2 y1 x3 w4 xin yin x2 x w ractical realizations (e.g. iwr) use quite general processors» Enable variety of algorithms on same hardware But dedicated interconnect channels» Data transfer directly from register to register across channel Specialized, and same problems as SID» General purpose systems work well for same algorithms (locality etc.) x1 w3 w2 w1 xout yout xout = x x= xin yout = yin + w xin Toward rchitectural onvergence Evolution and role of software have blurred boundary Send/recv supported on SS machines via buffers an construct global address space on (G -> L) age-based (or finer-grained) shared virtual memory Hardware organization converging too Tighter NI integration even for (low-latency, high-bandwidth) Hardware SS passes messages Even clusters of workstations/ss are parallel systems Emergence of fast system area networks (SN) rogramming models distinct, but organizations converging Nodes connected by general network and communication assists Implementations also converging, at least in high-end machines Lec 2.41 Lec 2.42 onvergence: Generic arallel rchitecture em ommunication assist () Node: processor(s), memory system, plus communication assist interface and communication controller Scalable network onvergence allows lots of innovation, within framework Integration of assist with node, what operations, how efficiently... Flynn s Taxonomy # instruction x # Data Single Instruction Single Data (SISD) Single Instruction ultiple Data (SID) ultiple Instruction Single Data ultiple Instruction ultiple Data (ID) Everything is ID! However Question is one of efficiency How easily (and at what power!) can you do certain operations? GU solution from NVIDI good at graphics is it good in general? s (ore?) Important: communication architecture How do processors communicate with one another How does the programmer build correct programs? Lec 2.43 Lec 2.44

12 ny hope for us to do research in multiprocessing? Yes: FGs as New Research latform s ~ 25 Us can fit in Field rogrammable Gate rray (FG), 1000-U system from ~ 40 FGs? 64-bit simple soft core RIS at 100Hz in 2004 (Virtex- II) FG generations every 1.5 yrs; 2X Us, 2X clock rate HW research community does logic design ( gate shareware ) to create out-of-thebox, assively arallel rocessor runs standard binaries of OS, apps Gateware: rocessors, aches, oherency, Ethernet Interfaces, Switches, Routers, (IB, Sun have donated processors) E.g., 1000 processor, IB ower binary-compatible, cache-coherent 200 Hz; fast enough for research Lec 2.45 R Since goal is to ramp up research in multiprocessing, called Research ccelerator for ultiple rocessors To learn more, read R: Research ccelerator for ultiple rocessors - ommunity Vision for a Shared Experimental arallel HW/SW latform, Technical Report UB//SD , Sept 2005 Web page ramp.eecs.berkeley.edu roject Opportunities? any Infrastructure development for research Validation against simulators/real systems Development of new communication features Etc. Lec 2.46 Why R Good for Research? ost (1000 Us) ost of ownership Scalability ower/space (kilowatts, racks) ommunity Observability Reproducibility Flexibility redibility erform. (clock) G S F (40) D (120 kw, 12 racks) D (120 kw, 12 racks) D D B D + (2 GHz) luster (2) D D + (3 GHz) B- Simulate + (0) + (.1 kw, 0.1 racks) F F (0 GHz) B R (0.1) (1.5 kw, 0.3 racks) (0.2 GHz) - Lec 2.47 R 1 Hardware ompleted Dec (14x17 inch 22-layer B) odule: FGs, memory, 10GigE conn. ompact Flash dministration/ maintenance ports:» 10/100 Enet» HDI/DVI» USB ~4K/module w/o FGs or DR alled BEE2 for Berkeley Emulation Engine 2 Lec 2.48

R Blue rototype (1/07) 8 icroblaze cores / FG 8 BEE2 modules (32 user FGs) x 4 FGs/module = 256 cores @ 100Hz Full

Thread scheduling Security enhancements Internet in a box ultiprocessor switch design Router design ompile to FG

ross-disciplinary interactions ccelerate innovation in multiprocessing

50 onclusion Several major types of communication: Shared emory essage assing Data-arallel Systolic DataFlow Is

13 R Blue rototype (1/07) 8 icroblaze cores / FG 8 BEE2 modules (32 user FGs) x 4 FGs/module = Hz Full star-connection between modules It works; runs NS benchmarks Us are softcore icroblazes (32-bit Xilinx RIS architecture) Lec 2.49 Vision: ultiprocessing Watering Hole R arallel file system Dataflow language/computer Data center in a box Thread scheduling Security enhancements Internet in a box ultiprocessor switch design Router design ompile to FG Fault insertion to check dependability arallel languages R attracts many communities to shared artifact ross-disciplinary interactions ccelerate innovation in multiprocessing R as next Standard Research latform? (e.g., VX/BSD Unix in 1980s, x86/linux in 1990s) Lec 2.50 onclusion Several major types of communication: Shared emory essage assing Data-arallel Systolic DataFlow Is communication Turing-complete? an simulate each of these on top of the other! any tradeoffs in hardware support ommunication is a first-class citizen! How to perform communication is essential» IS IT ILIIT or EXLIIT? What to do with communication errors? Does locality matter??? How to synchronize? Lec 2.51

Learning Curve for Parallel Applications. 500 Fastest Computers

Learning Curve for Parallel Applications. 500 Fastest Computers Learning Curve for arallel Applications ABER molecular dynamics simulation program Starting point was vector code for Cray-1 145 FLO on Cray90, 406 for final version on 128-processor aragon, 891 on 128-processor