Review. CS 258 Parallel Computer Architecture Lecture 2. Convergence of Parallel Architectures. Plan for Today. History
|
|
- Morris Tate
- 5 years ago
- Views:
Transcription
1 S 258 arallel omputer rchitecture Lecture 2 onvergence of arallel rchitectures January 28, 2008 rof John D. Kubiatowicz Review Industry has decided that ultiprocessing is the future/best use of transistors Every major chip manufacturer now making ultiore chips History of microprocessor architecture is parallelism translates area and density into performance The Future is higher levels of parallelism arallel rchitecture concepts apply at many levels ommunication also on exponential curve roper way to compute speedup Incorrect way to measure:» ompare parallel program on 1 processor to parallel program on p processors Instead:» Should compare uniprocessor program on 1 processor to parallel program on p processors Lec 2.2 History arallel architectures tied closely to programming models Divergent architectures, with no predictable pattern of growth. id 80s renaissance lan for Today Look at major programming models where did they come from? The 80s architectural rennaisance! What do they provide? How have they converged? Extract general structure and fundamental issues pplication Software Systolic rrays Dataflow System Software SID rchitecture essage assing Shared emory Lec 2.3 Systolic rrays Dataflow Generic SID rchitecture essage assing Shared emory Lec 2.4
2 rogramming odel onceptualization of the machine that programmer uses in coding applications How parts cooperate and coordinate their activities Specifies communication and synchronization operations Shared emory Shared ddr. Space rocessor rocessor rocessor rocessor ultiprogramming no communication or synch. at program level Shared address space like bulletin board essage passing like letters or phone calls, explicit point to point Data parallel: more regimented, global actions on data Implemented with shared address space or message passing Lec 2.5 rocessor rocessor Shared emory rocessor Range of addresses shared by all processors ll communication is Implicit (Through memory) Want to communicate a bunch of info? ass pointer. rogramming is straightforward Generalization of multithreaded programming rocessor Lec 2.6 Historical Development dding rocessing apacity ainframe approach otivated by multiprogramming Extends crossbar used for em and rocessor cost-limited => crossbar Bandwidth scales with p High incremental cost» use multistage instead em em em em ctrl Interconnect Interconnect ctrl devices inicomputer approach lmost all microprocessor systems have bus otivated by multiprogramming, T Used heavily for parallel computing alled symmetric multiprocessor (S) Latency larger than for uniprocessor Bus is bandwidth bottleneck» caching is key: coherence problem Low incremental cost rocessor rocessor emory capacity increased by adding modules by controllers and devices dd processors for processing! For higher-throughput multiprogramming, or parallel programs Lec 2.7 Lec 2.8
3 Shared hysical emory ny processor can directly reference any location ommunication operation is load/store Special operations for synchronization ny controller - any memory Operating system can run on any processor, or all. OS uses shared memory to coordinate What about application processes? Shared Virtual ddress Space rocess = address space plus thread of control Virtual-to-physical mapping can be established so that processes shared portions of address space. User-kernel or multiple processes ultiple threads of control on one address space. opular approach to structuring OS s Now standard application capability (ex: OSIX threads) Writes to shared address visible to other threads Natural extension of uniprocessors model conventional memory operations for communication special atomic operations for synchronization» also load/stores Lec 2.9 Lec 2.10 Structured Shared ddress Space n ache oherence roblem Virtual address spaces for a collection of processes communicating via shared addresses Load achine physical address space n pr i vat e W R? R? 1 2 ommon physical addresses Write-Through? 0 St or e Shared portion of address space rivate portion of address space 2 pr i vat e 1 pr i vat e 0 pr i vat e dd hoc parallelism used in system code ost parallel applications have structured SS Same program on each processor shared variable X means the same thing to each thread aches are aliases for memory locations Does every processor eventually see new value? Tightly related: ache onsistency In what order do writes appear to other processors? Buses make this easy: every processor can snoop on every write Essential feature: Broadcast iss Lec 2.11 Lec 2.12
4 Engineering: Intel entium ro Quad Engineering: SUN Enterprise U -ro -ro -ro Interrupt 256-KB module module module controller L 2 Bus interface -ro bus (64-bit data, 36-bit address, 66 Hz) 2 2 em ctrl U/mem cards I bridge I bridge emory controller Bus interface/switch I cards I bus I bus IU 1-, 2-, or 4-way interleaved DR ll coherence and multiprocessing glue in processor module Highly integrated, targeted at high volume Low latency and bandwidth roc + mem card - card 16 cards of either type ll memory accessed over bus, so symmetric Higher bandwidth, higher latency bus Gigaplane bus (256 data, 41 address, 83 Hz) 100bT, SSI Bus interface SBUS SBUS SBUS 2 Fiberhannel cards Lec 2.13 Lec 2.14 Quad-rocessor Xeon rchitecture Scaling Up Omega General Dance hall Distributed memory ll sharing through pairs of front side busses (FSB) emory traffic/cache misses through single chipset to memory Example Blackford chipset roblem is interconnect: cost (crossbar) or bandwidth (bus) Dance-hall: bandwidth still scalable, but lower cost than crossbar» latencies to memory uniform, but uniformly large Distributed memory or non-uniform memory access (NU)» onstruct shared address space out of simple message transactions across a general-purpose network (e.g. readrequest, read-response) aching shared (particularly nonlocal) data? Lec 2.15 Lec 2.16
5 Stanford DSH The IT lewife ultiprocessor lusters of 4 processors share 2 nd -level cache Up to 16 clusters tied together with 2-dim mesh 16-bit directory associated with every memory line Each memory line has home cluster that contains DR The 16-bit vector says which clusters (if any) have read copies Only one writer permitted at a time Never got more than 12 clusters (48 processors) working at one time: synchronous network probs! L1- L1- L1- L2-ache L1- Lec 2.17 ache-coherence Shared emory artially in Software! Limited Directory + software overflow User-level essage-assing Rapid ontext-switching 2-dimentional synchronous network One node/board Got 32-processors (+ boards) working Lec 2.18 Engineering: ray T3E D Direct onnect External em em ctrl and NI XY Switch Z Scale up to 1024 processors, 480B/s links emory controller generates request message for non-local references No hardware mechanism for coherence» SGI Origin etc. provide this Lec 2.19 ommunication over general interconnect Shared memory/address space traffic over network traffic to memory over network ultiple topology options (seems to scale to 8 or 16 processor chips) Lec 2.20
6 What is underlying Shared emory?? essage assing rchitectures Systolic rrays Generic rchitecture SID essage assing Dataflow Shared emory acket switched networks better utilize available link bandwidth than circuit switched networks So, network passes messages around! omplete computer as building block, including ommunication via Explicit operations rogramming model direct access only to private address space (local memory), communication via explicit messages (send/receive) High-level block diagram ommunication integration?» em,, LN, luster Easier to build and scale than SS rogramming model more removed from basic hardware operations Library or OS intervention Lec 2.21 Lec 2.22 essage-assing bstraction atch Receive Y,,t ddressy Send X, Q, t Evolution of essage-assing achines Early machines: FIFO on each link HW close to prog. odel; synchronous ops topology central (hypercube algorithms) ddressx Local process address space Local process address space rocess rocessq Send specifies buffer to be transmitted and receiving process Recv specifies sending process and application storage to receive into emory to memory copy, but need to name processes Optional tag on send and matching rule on receive User process names local data and entities in process/tag space too In simplest form, the send/recv match achieves pairwise synch event» Other variants too any overheads: copying, buffer management, protection altech osmic ube (Seitz, Jan 95) Lec 2.23 Lec 2.24
7 IT J-achine (Jelly-bean machine) 3-dimensional network topology Non-adaptive, E-cubed routing Hardware routing aximize density of communication 64-nodes/board, 1024 nodes total Low-powered processors essage passing instructions ssociative array primitives to aid in synthesizing shared-address space Extremely fine-grained communication Hardware-supported ctive essages Lec 2.25 Diminishing Role of Topology? Shift to general links D, enabling non-blocking ops» Buffered by system at destination until recv Store&forward routing Fault-tolerant, multi-path routing: Diminishing role of topology ny-to-any pipelined routing node-network interface dominates communication time» fast relative to overhead» Will this change for anyore? Simplifies programming llows richer design space» grids vs hypercubes Intel is/1 -> is/2 -> is/860 Lec 2.26 Example Intel aragon Building on the mainstream: IB S-2 Sandia s Intel aragon X/S-based Supercomputer i860 L 1 i860 L 1 emory bus (64-bit, 50 Hz) em ctrl 4-way interleaved DR Driver Intel aragon node D NI ade out of essentially complete RS6000 workstations interface integrated in bus (bw limited by bus) General interconnection network formed from 8-port switches ower 2 U IB S-2 node L 2 emory bus emory 4-way interleaved controller DR 2D grid network with processing node attached to every switch 8 bits, 175 Hz, bidirectional icrohannel bus NI D i860 NI DR Lec 2.27 Lec 2.28
8 Berkeley NOW Data arallel Systems 100 Sun Ultra2 workstations Inteligent network interface proc + mem yrinet 160 B/s per link 300 ns per hop rogramming model Operations performed in parallel on each element of data structure Logically single thread of control, performs sequential or parallel steps onceptually, a processor associated with each data element rchitectural model rray of many simple, cheap processors with little memory each» rocessors don t sequence through instructions ttached to a control processor that issues instructions Specialized and general communication, cheap global synchronization Original motivations atches simple differential equation solvers entralize high cost of instruction fetch/sequencing ontrol processor E E E E E E Lec 2.29 E E E Lec 2.30 pplication of Data arallelism onnection achine Each E contains an employee record with his/her salary If salary > 100K then salary = salary *1.05 else salary = salary *1.10 Logically, the whole operation is a single step Some processors enabled for arithmetic operation, others disabled Other examples: Finite differences, linear algebra,... Document searching, graphics, image processing,... Some recent machines: Thinking achines -1, -2 (and -5) aspar -1 and -2, (Tucker, IEEE omputer, ug. 1988) Lec 2.31 Lec 2.32
9 NVidia Tesla rchitecture ombined GU and general U omponents of NVidia Tesla architecture S has 8 S thread processor cores 32 GFLOS peak at 1.35 GHz IEEE bit floating point 32-bit, 64-bit integer 2 SFU special function units Scalar IS emory load/store/atomic Texture fetch Branch, call, return Barrier synchronization instruction ultithreaded Instruction Unit 768 independent threads per S HW multithreading & scheduling 16KB Shared emory oncurrent threads share data Low latency load/store Full GU Total performance > 500GOps Lec 2.33 Lec 2.34 Evolution and onvergence -5 SID opular when cost savings of centralized sequencer high 60s when U was a cabinet Replaced by vectors in mid-70s» ore flexible w.r.t. memory layout and easier to manage Revived in mid-80s when 32-bit datapath slices just fit on chip Simple, regular applications have good locality rogramming model converges with SD (single program multiple data) need fast global synchronization Structured global address space, implemented with either SS or Repackaged SparcStation 4 per board Fat-Tree network ontrol network for global synchronization Lec 2.35 Lec 2.36
10 Systolic rrays Dataflow Generic SID rchitecture essage assing Shared emory Dataflow rchitectures Represent computation as a graph of essential dependences Logical processor at each node, activated by availability of operands essage (tokens) carrying tag of next instruction sent to next processor Tag compared with others in matching store; match fires execution a = (b +1) (b c) d = c e f = a d 1 b + a f c d e Dataflow graph Lec 2.37 Token store Waiting atching rogram store Instruction fetch Execute Token queue Form token 1/28/08 Kubiatowicz S258 UB Spring 2008 onsoon (IT) Lec 2.38 Evolution and onvergence Systolic rchitectures Key characteristics bility to name operations, synchronization, dynamic scheduling roblems Operations have locality across them, useful to group together Handling complex data structures like arrays omplexity of matching store and memory units Expose too much parallelism (?) onverged to use conventional processors and memory Support for large, dynamic set of threads to map to processors Typically shared address space as well But separation of progr. model from hardware (like data-parallel) Lasting contributions: Integration of communication with thread (handler) generation Tightly integrated communication and fine-grained synchronization Remained useful concept for software (compilers etc.) VLSI enables inexpensive special-purpose chips Represent algorithms directly by chips connected in regular pattern Replace single processor with array of regular processing elements Orchestrate data flow for high throughput with less memory access E Different from pipelining Nonlinear array structure, multidirection data flow, each E may have (small) local instruction and data memory SID? : each E may do something different E E E Lec 2.39 Lec 2.40
11 Systolic rrays (contd.) Example: Systolic array for 1-D convolution x8 x7 y(i) = w1 x(i) + w2 x(i + 1) + w3 x(i + 2) + w4 x(i + 3) x6 x5 x4 y3 y2 y1 x3 w4 xin yin x2 x w ractical realizations (e.g. iwr) use quite general processors» Enable variety of algorithms on same hardware But dedicated interconnect channels» Data transfer directly from register to register across channel Specialized, and same problems as SID» General purpose systems work well for same algorithms (locality etc.) x1 w3 w2 w1 xout yout xout = x x= xin yout = yin + w xin Toward rchitectural onvergence Evolution and role of software have blurred boundary Send/recv supported on SS machines via buffers an construct global address space on (G -> L) age-based (or finer-grained) shared virtual memory Hardware organization converging too Tighter NI integration even for (low-latency, high-bandwidth) Hardware SS passes messages Even clusters of workstations/ss are parallel systems Emergence of fast system area networks (SN) rogramming models distinct, but organizations converging Nodes connected by general network and communication assists Implementations also converging, at least in high-end machines Lec 2.41 Lec 2.42 onvergence: Generic arallel rchitecture em ommunication assist () Node: processor(s), memory system, plus communication assist interface and communication controller Scalable network onvergence allows lots of innovation, within framework Integration of assist with node, what operations, how efficiently... Flynn s Taxonomy # instruction x # Data Single Instruction Single Data (SISD) Single Instruction ultiple Data (SID) ultiple Instruction Single Data ultiple Instruction ultiple Data (ID) Everything is ID! However Question is one of efficiency How easily (and at what power!) can you do certain operations? GU solution from NVIDI good at graphics is it good in general? s (ore?) Important: communication architecture How do processors communicate with one another How does the programmer build correct programs? Lec 2.43 Lec 2.44
12 ny hope for us to do research in multiprocessing? Yes: FGs as New Research latform s ~ 25 Us can fit in Field rogrammable Gate rray (FG), 1000-U system from ~ 40 FGs? 64-bit simple soft core RIS at 100Hz in 2004 (Virtex- II) FG generations every 1.5 yrs; 2X Us, 2X clock rate HW research community does logic design ( gate shareware ) to create out-of-thebox, assively arallel rocessor runs standard binaries of OS, apps Gateware: rocessors, aches, oherency, Ethernet Interfaces, Switches, Routers, (IB, Sun have donated processors) E.g., 1000 processor, IB ower binary-compatible, cache-coherent 200 Hz; fast enough for research Lec 2.45 R Since goal is to ramp up research in multiprocessing, called Research ccelerator for ultiple rocessors To learn more, read R: Research ccelerator for ultiple rocessors - ommunity Vision for a Shared Experimental arallel HW/SW latform, Technical Report UB//SD , Sept 2005 Web page ramp.eecs.berkeley.edu roject Opportunities? any Infrastructure development for research Validation against simulators/real systems Development of new communication features Etc. Lec 2.46 Why R Good for Research? ost (1000 Us) ost of ownership Scalability ower/space (kilowatts, racks) ommunity Observability Reproducibility Flexibility redibility erform. (clock) G S F (40) D (120 kw, 12 racks) D (120 kw, 12 racks) D D B D + (2 GHz) luster (2) D D + (3 GHz) B- Simulate + (0) + (.1 kw, 0.1 racks) F F (0 GHz) B R (0.1) (1.5 kw, 0.3 racks) (0.2 GHz) - Lec 2.47 R 1 Hardware ompleted Dec (14x17 inch 22-layer B) odule: FGs, memory, 10GigE conn. ompact Flash dministration/ maintenance ports:» 10/100 Enet» HDI/DVI» USB ~4K/module w/o FGs or DR alled BEE2 for Berkeley Emulation Engine 2 Lec 2.48
13 R Blue rototype (1/07) 8 icroblaze cores / FG 8 BEE2 modules (32 user FGs) x 4 FGs/module = Hz Full star-connection between modules It works; runs NS benchmarks Us are softcore icroblazes (32-bit Xilinx RIS architecture) Lec 2.49 Vision: ultiprocessing Watering Hole R arallel file system Dataflow language/computer Data center in a box Thread scheduling Security enhancements Internet in a box ultiprocessor switch design Router design ompile to FG Fault insertion to check dependability arallel languages R attracts many communities to shared artifact ross-disciplinary interactions ccelerate innovation in multiprocessing R as next Standard Research latform? (e.g., VX/BSD Unix in 1980s, x86/linux in 1990s) Lec 2.50 onclusion Several major types of communication: Shared emory essage assing Data-arallel Systolic DataFlow Is communication Turing-complete? an simulate each of these on top of the other! any tradeoffs in hardware support ommunication is a first-class citizen! How to perform communication is essential» IS IT ILIIT or EXLIIT? What to do with communication errors? Does locality matter??? How to synchronize? Lec 2.51
Learning Curve for Parallel Applications. 500 Fastest Computers
Learning Curve for arallel Applications ABER molecular dynamics simulation program Starting point was vector code for Cray-1 145 FLO on Cray90, 406 for final version on 128-processor aragon, 891 on 128-processor
More informationEvolution and Convergence of Parallel Architectures
History Evolution and Convergence of arallel Architectures Historically, parallel architectures tied to programming models Divergent architectures, with no predictable pattern of growth. Todd C. owry CS
More informationECE 669 Parallel Computer Architecture
ECE 669 arallel Computer Architecture Lecture 2 Architectural erspective Overview Increasingly attractive Economics, technology, architecture, application demand Increasingly central and mainstream arallelism
More informationParallel Programming Models and Architecture
Parallel Programming Models and Architecture CS 740 September 18, 2013 Seth Goldstein Carnegie Mellon University History Historically, parallel architectures tied to programming models Divergent architectures,
More informationConvergence of Parallel Architecture
Parallel Computing Convergence of Parallel Architecture Hwansoo Han History Parallel architectures tied closely to programming models Divergent architectures, with no predictable pattern of growth Uncertainty
More informationThree basic multiprocessing issues
Three basic multiprocessing issues 1. artitioning. The sequential program must be partitioned into subprogram units or tasks. This is done either by the programmer or by the compiler. 2. Scheduling. Associated
More informationNumber of processing elements (PEs). Computing power of each element. Amount of physical memory used. Data access, Communication and Synchronization
Parallel Computer Architecture A parallel computer is a collection of processing elements that cooperate to solve large problems fast Broad issues involved: Resource Allocation: Number of processing elements
More informationThree parallel-programming models
Three parallel-programming models Shared-memory programming is like using a bulletin board where you can communicate with colleagues. essage-passing is like communicating via e-mail or telephone calls.
More informationCPS104 Computer Organization and Programming Lecture 20: Superscalar processors, Multiprocessors. Robert Wagner
CS104 Computer Organization and rogramming Lecture 20: Superscalar processors, Multiprocessors Robert Wagner Faster and faster rocessors So much to do, so little time... How can we make computers that
More informationParallel Architecture Fundamentals
arallel Architecture Fundamentals Topics CS 740 September 22, 2003 What is arallel Architecture? Why arallel Architecture? Evolution and Convergence of arallel Architectures Fundamental Design Issues What
More informationNOW Handout Page 1. Recap: Gigaplane Bus Timing. Scalability
Recap: Gigaplane Bus Timing 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Address Rd A Rd B Scalability State Arbitration 1 4,5 2 Share ~Own 6 Own 7 A D A D A D A D A D A D A D A D CS 258, Spring 99 David E. Culler
More informationParallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University
Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Elements of a Parallel Computer Hardware Multiple processors Multiple
More informationParallel Computing Platforms
Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu)
More informationParallel Programming Platforms
arallel rogramming latforms Ananth Grama Computing Research Institute and Department of Computer Sciences, urdue University ayg@cspurdueedu http://wwwcspurdueedu/people/ayg Reference: Introduction to arallel
More informationCray XE6 Performance Workshop
Cray XE6 erformance Workshop odern HC Architectures David Henty d.henty@epcc.ed.ac.uk ECC, University of Edinburgh Overview Components History Flynn s Taxonomy SID ID Classification via emory Distributed
More informationConvergence of Parallel Architectures
History Historically, parallel architectures tied to programming models Divergent architectures, with no predictable pattern of growth. Systolic Arrays Dataflow Application Software System Software Architecture
More informationCOSC 6374 Parallel Computation. Parallel Computer Architectures
OS 6374 Parallel omputation Parallel omputer Architectures Some slides on network topologies based on a similar presentation by Michael Resch, University of Stuttgart Spring 2010 Flynn s Taxonomy SISD:
More informationCS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Introduction (Chapter 1)
CS/ECE 757: Advanced Computer Architecture II (arallel Computer Architecture) Introduction (Chapter 1) Copyright 2001 Mark D. Hill University of Wisconsin-Madison Slides are derived from work by Sarita
More informationConventional Computer Architecture. Abstraction
Conventional Computer Architecture Conventional = Sequential or Single Processor Single Processor Abstraction Conventional computer architecture has two aspects: 1 The definition of critical abstraction
More informationScalable Multiprocessors
arallel Computer Organization and Design : Lecture 7 er Stenström. 2008, Sally A. ckee 2009 Scalable ultiprocessors What is a scalable design? (7.1) Realizing programming models (7.2) Scalable communication
More informationLimitations of Memory System Performance
Slides taken from arallel Computing latforms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar! " To accompany the text ``Introduction to arallel Computing'', Addison Wesley, 2003. Limitations
More informationCOSC 6374 Parallel Computation. Parallel Computer Architectures
OS 6374 Parallel omputation Parallel omputer Architectures Some slides on network topologies based on a similar presentation by Michael Resch, University of Stuttgart Edgar Gabriel Fall 2015 Flynn s Taxonomy
More informationLecture 28 Introduction to Parallel Processing and some Architectural Ramifications. Flynn s Taxonomy. Multiprocessing.
1 2 Lecture 28 Introduction to arallel rocessing and some Architectural Ramifications 3 4 ultiprocessing Flynn s Taxonomy Flynn s Taxonomy of arallel achines How many Instruction streams? How many Data
More informationChapter 9 Multiprocessors
ECE200 Computer Organization Chapter 9 Multiprocessors David H. lbonesi and the University of Rochester Henk Corporaal, TU Eindhoven, Netherlands Jari Nurmi, Tampere University of Technology, Finland University
More informationComputer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors
Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently
More informationIntroduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano
Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed
More informationScalable Cache Coherent Systems
NUM SS Scalable ache oherent Systems Scalable distributed shared memory machines ssumptions: rocessor-ache-memory nodes connected by scalable network. Distributed shared physical address space. ommunication
More informationChapter 2: Computer-System Structures. Hmm this looks like a Computer System?
Chapter 2: Computer-System Structures Lab 1 is available online Last lecture: why study operating systems? Purpose of this lecture: general knowledge of the structure of a computer system and understanding
More informationCS Parallel Algorithms in Scientific Computing
CS 775 - arallel Algorithms in Scientific Computing arallel Architectures January 2, 2004 Lecture 2 References arallel Computer Architecture: A Hardware / Software Approach Culler, Singh, Gupta, Morgan
More informationUniprocessor Computer Architecture Example: Cray T3E
Chapter 2: Computer-System Structures MP Example: Intel Pentium Pro Quad Lab 1 is available online Last lecture: why study operating systems? Purpose of this lecture: general knowledge of the structure
More informationCS 770G - Parallel Algorithms in Scientific Computing Parallel Architectures. May 7, 2001 Lecture 2
CS 770G - arallel Algorithms in Scientific Computing arallel Architectures May 7, 2001 Lecture 2 References arallel Computer Architecture: A Hardware / Software Approach Culler, Singh, Gupta, Morgan Kaufmann
More informationDr. Joe Zhang PDC-3: Parallel Platforms
CSC630/CSC730: arallel & Distributed Computing arallel Computing latforms Chapter 2 (2.3) 1 Content Communication models of Logical organization (a programmer s view) Control structure Communication model
More informationLecture 17: Parallel Architectures and Future Computer Architectures. Shared-Memory Multiprocessors
Lecture 17: arallel Architectures and Future Computer Architectures rof. Kunle Olukotun EE 282h Fall 98/99 1 Shared-emory ultiprocessors Several processors share one address space» conceptually a shared
More informationSpring 2011 Parallel Computer Architecture Lecture 4: Multi-core. Prof. Onur Mutlu Carnegie Mellon University
18-742 Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core Prof. Onur Mutlu Carnegie Mellon University Research Project Project proposal due: Jan 31 Project topics Does everyone have a topic?
More informationProcessor Architecture and Interconnect
Processor Architecture and Interconnect What is Parallelism? Parallel processing is a term used to denote simultaneous computation in CPU for the purpose of measuring its computation speeds. Parallel Processing
More informationParallel Architectures
Parallel Architectures Part 1: The rise of parallel machines Intel Core i7 4 CPU cores 2 hardware thread per core (8 cores ) Lab Cluster Intel Xeon 4/10/16/18 CPU cores 2 hardware thread per core (8/20/32/36
More informationParallel Computer Architecture
arallel Computer Architecture CS 472 Concurrent & arallel rogramming University of Evansville Selection of slides from CIS 410/510 Introduction to arallel Computing Department of Computer and Information
More informationHistory of Distributed Systems. Joseph Cordina
History of Distributed Systems Joseph Cordina joseph.cordina@um.edu.mt otivation Computation demands were always higher than technological status quo Obvious answer Several computing elements working in
More informationChap. 4 Multiprocessors and Thread-Level Parallelism
Chap. 4 Multiprocessors and Thread-Level Parallelism Uniprocessor performance Performance (vs. VAX-11/780) 10000 1000 100 10 From Hennessy and Patterson, Computer Architecture: A Quantitative Approach,
More informationPlatforms Design Challenges with many cores
latforms Design hallenges with many cores Raj Yavatkar, Intel Fellow Director, Systems Technology Lab orporate Technology Group 1 Environmental Trends: ell 2 *Other names and brands may be claimed as the
More informationEE382 Processor Design. Illinois
EE382 Processor Design Winter 1998 Chapter 8 Lectures Multiprocessors Part II EE 382 Processor Design Winter 98/99 Michael Flynn 1 Illinois EE 382 Processor Design Winter 98/99 Michael Flynn 2 1 Write-invalidate
More informationComputer Systems Architecture
Computer Systems Architecture Lecture 24 Mahadevan Gomathisankaran April 29, 2010 04/29/2010 Lecture 24 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student
More informationMULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationMULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationCMSC 411 Computer Systems Architecture Lecture 21 Multiprocessors 3
MS 411 omputer Systems rchitecture Lecture 21 Multiprocessors 3 Outline Review oherence Write onsistency dministrivia Snooping Building Blocks Snooping protocols and examples oherence traffic and performance
More informationSMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems
Reference Papers on SMP/NUMA Systems: EE 657, Lecture 5 September 14, 2007 SMP and ccnuma Multiprocessor Systems Professor Kai Hwang USC Internet and Grid Computing Laboratory Email: kaihwang@usc.edu [1]
More informationMultiprocessor Interconnection Networks
Multiprocessor Interconnection Networks Todd C. Mowry CS 740 November 19, 1998 Topics Network design space Contention Active messages Networks Design Options: Topology Routing Direct vs. Indirect Physical
More informationScalable Distributed Memory Machines
Scalable Distributed Memory Machines Goal: Parallel machines that can be scaled to hundreds or thousands of processors. Design Choices: Custom-designed or commodity nodes? Network scalability. Capability
More informationPARALLEL COMPUTER ARCHITECTURES
8 ARALLEL COMUTER ARCHITECTURES 1 CU Shared memory (a) (b) Figure 8-1. (a) A multiprocessor with 16 CUs sharing a common memory. (b) An image partitioned into 16 sections, each being analyzed by a different
More informationScalable Cache Coherent Systems Scalable distributed shared memory machines Assumptions:
Scalable ache oherent Systems Scalable distributed shared memory machines ssumptions: rocessor-ache-memory nodes connected by scalable network. Distributed shared physical address space. ommunication assist
More informationMultiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.
Multiprocessors why would you want a multiprocessor? Multiprocessors and Multithreading more is better? Cache Cache Cache Classifying Multiprocessors Flynn Taxonomy Flynn Taxonomy Interconnection Network
More informationNOW Handout Page 1. Recap: Performance Trade-offs. Shared Memory Multiprocessors. Uniprocessor View. Recap (cont) What is a Multiprocessor?
ecap: erformance Trade-offs Shared ory Multiprocessors CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley rogrammer s View of erformance Speedup < Sequential Work Max (Work + Synch
More informationMultiprocessors - Flynn s Taxonomy (1966)
Multiprocessors - Flynn s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) Conventional uniprocessor Although ILP is exploited Single Program Counter -> Single Instruction stream The
More informationParallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?
Parallel Computers CPE 63 Session 20: Multiprocessors Department of Electrical and Computer Engineering University of Alabama in Huntsville Definition: A parallel computer is a collection of processing
More informationOutline Marquette University
COEN-4710 Computer Hardware Lecture 1 Computer Abstractions and Technology (Ch.1) Cristinel Ababei Department of Electrical and Computer Engineering Credits: Slides adapted primarily from presentations
More informationA Scalable SAS Machine
arallel omputer Organization and Design : Lecture 8 er Stenström. 2008, Sally. ckee 2009 Scalable ache oherence Design principles of scalable cache protocols Overview of design space (8.1) Basic operation
More informationHandout 3 Multiprocessor and thread level parallelism
Handout 3 Multiprocessor and thread level parallelism Outline Review MP Motivation SISD v SIMD (SIMT) v MIMD Centralized vs Distributed Memory MESI and Directory Cache Coherency Synchronization and Relaxed
More informationMultiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University
A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor
More informationComputing architectures Part 2 TMA4280 Introduction to Supercomputing
Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:
More informationApproaches to Building Parallel Machines. Shared Memory Architectures. Example Cache Coherence Problem. Shared Cache Architectures
Approaches to Building arallel achines Switch/Bus n Scale Shared ory Architectures (nterleaved) First-level (nterleaved) ain memory n Arvind Krishnamurthy Fall 2004 (nterleaved) ain memory Shared Cache
More informationCache Coherence in Scalable Machines
ache oherence in Scalable Machines SE 661 arallel and Vector Architectures rof. Muhamed Mudawar omputer Engineering Department King Fahd University of etroleum and Minerals Generic Scalable Multiprocessor
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More informationInterconnection Network
Interconnection Network Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) Topics
More informationInterconnection Network. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University
Interconnection Network Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Topics Taxonomy Metric Topologies Characteristics Cost Performance 2 Interconnection
More informationCCS HPC. Interconnection Network. PC MPP (Massively Parallel Processor) MPP IBM
CCS HC taisuke@cs.tsukuba.ac.jp 1 2 CU memoryi/o 2 2 4single chipmulti-core CU 10 C CM (Massively arallel rocessor) M IBM BlueGene/L 65536 Interconnection Network 3 4 (distributed memory system) (shared
More informationIssues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University
Issues in Parallel Processing Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Introduction Goal: connecting multiple computers to get higher performance
More informationChapter 1: Perspectives
Chapter 1: Perspectives Copyright @ 2005-2008 Yan Solihin Copyright notice: No part of this publication may be reproduced, stored in a retrieval system, or transmitted by any means (electronic, mechanical,
More informationWhat is a parallel computer?
7.5 credit points Power 2 CPU L 2 $ IBM SP-2 node Instructor: Sally A. McKee General interconnection network formed from 8-port switches Memory bus Memory 4-way interleaved controller DRAM MicroChannel
More informationIntro to Multiprocessors
The Big Picture: Where are We Now? Intro to Multiprocessors Output Output Datapath Input Input Datapath [dapted from Computer Organization and Design, Patterson & Hennessy, 2005] Multiprocessor multiple
More information[ 7.2.5] Certain challenges arise in realizing SAS or messagepassing programming models. Two of these are input-buffer overflow and fetch deadlock.
Buffering roblems [ 7.2.5] Certain challenges arise in realizing SAS or messagepassing programming models. Two of these are input-buffer overflow and fetch deadlock. Input-buffer overflow Suppose a large
More informationParallel Architectures
Parallel Architectures Instructor: Tsung-Che Chiang tcchiang@ieee.org Department of Science and Information Engineering National Taiwan Normal University Introduction In the roughly three decades between
More informationCS252 Graduate Computer Architecture Lecture 14. Multiprocessor Networks March 9 th, 2011
CS252 Graduate Computer Architecture Lecture 14 Multiprocessor Networks March 9 th, 2011 John Kubiatowicz Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~kubitron/cs252
More informationCS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics
CS4230 Parallel Programming Lecture 3: Introduction to Parallel Architectures Mary Hall August 28, 2012 Homework 1: Parallel Programming Basics Due before class, Thursday, August 30 Turn in electronically
More informationNon-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.
CS 320 Ch. 17 Parallel Processing Multiple Processor Organization The author makes the statement: "Processors execute programs by executing machine instructions in a sequence one at a time." He also says
More informationNOW Handout Page 1. *T: Network Co-Processor. General Purpose Node-to-Network Interface in Scalable Multiprocessors. Dedicated Message Processor
*T: o-rocessor General urpose Node-to- Interface in Scalable Multiprocessors S, Spring 99 David. uller omputer Science Division U.. erkeley 3//99 S S99 iwr: Systolic omputation Host Interface unit Dedicated
More informationECE 669 Parallel Computer Architecture
ECE 669 Parallel Computer Architecture Lecture 27 Course Wrap Up What is Parallel Architecture? A parallel computer is a collection of processing elements that cooperate to solve large problems fast Some
More informationMultiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed
Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448 1 The Greed for Speed Two general approaches to making computers faster Faster uniprocessor All the techniques we ve been looking
More informationParallel Computer Architecture
Parallel Computer Architecture What is Parallel Architecture? A parallel computer is a collection of processing elements that cooperate to solve large problems fast Some broad issues: Resource Allocation:»
More informationComputer Systems Architecture
Computer Systems Architecture Lecture 23 Mahadevan Gomathisankaran April 27, 2010 04/27/2010 Lecture 23 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student
More informationLecture 1: Introduction
Lecture 1: Introduction ourse organization: 13 lectures on parallel architectures ~5 lectures on cache coherence, consistency ~3 lectures on TM ~2 lectures on interconnection networks ~2 lectures on large
More informationParallel Architecture. Hwansoo Han
Parallel Architecture Hwansoo Han Performance Curve 2 Unicore Limitations Performance scaling stopped due to: Power Wire delay DRAM latency Limitation in ILP 3 Power Consumption (watts) 4 Wire Delay Range
More informationInterconnection Networks
Lecture 17: Interconnection Networks Parallel Computer Architecture and Programming A comment on web site comments It is okay to make a comment on a slide/topic that has already been commented on. In fact
More informationThe Sun Fireplane Interconnect in the Mid- Range Sun Fire Servers
TAK IT TO TH NTH Alan Charlesworth icrosystems The Fireplane Interconnect in the id- Range Fire Servers Vertical & Horizontal Scaling any CUs in one box Cache-coherent shared memory (S) Usually proprietary
More informationEE382 Processor Design. Processor Issues for MP
EE382 Processor Design Winter 1998 Chapter 8 Lectures Multiprocessors, Part I EE 382 Processor Design Winter 98/99 Michael Flynn 1 Processor Issues for MP Initialization Interrupts Virtual Memory TLB Coherency
More informationOutline. Limited Scaling of a Bus
Outline Scalability physical, bandwidth, latency and cost level of integration Realizing rogramming Models network transactions protocols safety input buffer problem: N-1 fetch deadlock Communication Architecture
More informationCOSC 6385 Computer Architecture - Thread Level Parallelism (I)
COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month
More information10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems
1 License: http://creativecommons.org/licenses/by-nc-nd/3.0/ 10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems To enhance system performance and, in some cases, to increase
More informationScalable Multiprocessors
Scalable Multiprocessors [ 11.1] scalable system is one in which resources can be added to the system without reaching a hard limit. Of course, there may still be economic limits. s the size of the system
More informationComputer Architecture Spring 2016
Computer Architecture Spring 2016 Lecture 19: Multiprocessing Shuai Wang Department of Computer Science and Technology Nanjing University [Slides adapted from CSE 502 Stony Brook University] Getting More
More informationMessage Passing Models and Multicomputer distributed system LECTURE 7
Message Passing Models and Multicomputer distributed system LECTURE 7 DR SAMMAN H AMEEN 1 Node Node Node Node Node Node Message-passing direct network interconnection Node Node Node Node Node Node PAGE
More informationWHY PARALLEL PROCESSING? (CE-401)
PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:
More informationParallel Arch. Review
Parallel Arch. Review Zeljko Zilic McConnell Engineering Building Room 536 Main Points Understanding of the design and engineering of modern parallel computers Technology forces Fundamental architectural
More informationCSE502 Graduate Computer Architecture. Lec 22 Goodbye to Computer Architecture and Review
CSE502 Graduate Computer Architecture Lec 22 Goodbye to Computer Architecture and Review Larry Wittie Computer Science, StonyBrook University http://www.cs.sunysb.edu/~cse502 and ~lw Slides adapted from
More informationMultiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering
Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to
More informationLecture 1: Introduction
Contemporary Computer Architecture Instruction set architecture Lecture 1: Introduction CprE 581 Computer Systems Architecture, Fall 2016 Reading: Textbook, Ch. 1.1-1.7 Microarchitecture; examples: Pipeline
More informationParallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization
Computer Architecture Computer Architecture Prof. Dr. Nizamettin AYDIN naydin@yildiz.edu.tr nizamettinaydin@gmail.com Parallel Processing http://www.yildiz.edu.tr/~naydin 1 2 Outline Multiple Processor
More informationCOEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence
1 COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence Cristinel Ababei Dept. of Electrical and Computer Engineering Marquette University Credits: Slides adapted from presentations
More informationCOSC 6385 Computer Architecture - Multi Processor Systems
COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:
More informationLecture notes for CS Chapter 4 11/27/18
Chapter 5: Thread-Level arallelism art 1 Introduction What is a parallel or multiprocessor system? Why parallel architecture? erformance potential Flynn classification Communication models Architectures
More informationCache Coherence in Bus-Based Shared Memory Multiprocessors
Cache Coherence in Bus-Based Shared Memory Multiprocessors Shared Memory Multiprocessors Variations Cache Coherence in Shared Memory Multiprocessors A Coherent Memory System: Intuition Formal Definition
More information