Review. CS 258 Parallel Computer Architecture Lecture 2. Convergence of Parallel Architectures. Plan for Today. History

Size: px
Start display at page:

Download "Review. CS 258 Parallel Computer Architecture Lecture 2. Convergence of Parallel Architectures. Plan for Today. History"

Transcription

1 S 258 arallel omputer rchitecture Lecture 2 onvergence of arallel rchitectures January 28, 2008 rof John D. Kubiatowicz Review Industry has decided that ultiprocessing is the future/best use of transistors Every major chip manufacturer now making ultiore chips History of microprocessor architecture is parallelism translates area and density into performance The Future is higher levels of parallelism arallel rchitecture concepts apply at many levels ommunication also on exponential curve roper way to compute speedup Incorrect way to measure:» ompare parallel program on 1 processor to parallel program on p processors Instead:» Should compare uniprocessor program on 1 processor to parallel program on p processors Lec 2.2 History arallel architectures tied closely to programming models Divergent architectures, with no predictable pattern of growth. id 80s renaissance lan for Today Look at major programming models where did they come from? The 80s architectural rennaisance! What do they provide? How have they converged? Extract general structure and fundamental issues pplication Software Systolic rrays Dataflow System Software SID rchitecture essage assing Shared emory Lec 2.3 Systolic rrays Dataflow Generic SID rchitecture essage assing Shared emory Lec 2.4

2 rogramming odel onceptualization of the machine that programmer uses in coding applications How parts cooperate and coordinate their activities Specifies communication and synchronization operations Shared emory Shared ddr. Space rocessor rocessor rocessor rocessor ultiprogramming no communication or synch. at program level Shared address space like bulletin board essage passing like letters or phone calls, explicit point to point Data parallel: more regimented, global actions on data Implemented with shared address space or message passing Lec 2.5 rocessor rocessor Shared emory rocessor Range of addresses shared by all processors ll communication is Implicit (Through memory) Want to communicate a bunch of info? ass pointer. rogramming is straightforward Generalization of multithreaded programming rocessor Lec 2.6 Historical Development dding rocessing apacity ainframe approach otivated by multiprogramming Extends crossbar used for em and rocessor cost-limited => crossbar Bandwidth scales with p High incremental cost» use multistage instead em em em em ctrl Interconnect Interconnect ctrl devices inicomputer approach lmost all microprocessor systems have bus otivated by multiprogramming, T Used heavily for parallel computing alled symmetric multiprocessor (S) Latency larger than for uniprocessor Bus is bandwidth bottleneck» caching is key: coherence problem Low incremental cost rocessor rocessor emory capacity increased by adding modules by controllers and devices dd processors for processing! For higher-throughput multiprogramming, or parallel programs Lec 2.7 Lec 2.8

3 Shared hysical emory ny processor can directly reference any location ommunication operation is load/store Special operations for synchronization ny controller - any memory Operating system can run on any processor, or all. OS uses shared memory to coordinate What about application processes? Shared Virtual ddress Space rocess = address space plus thread of control Virtual-to-physical mapping can be established so that processes shared portions of address space. User-kernel or multiple processes ultiple threads of control on one address space. opular approach to structuring OS s Now standard application capability (ex: OSIX threads) Writes to shared address visible to other threads Natural extension of uniprocessors model conventional memory operations for communication special atomic operations for synchronization» also load/stores Lec 2.9 Lec 2.10 Structured Shared ddress Space n ache oherence roblem Virtual address spaces for a collection of processes communicating via shared addresses Load achine physical address space n pr i vat e W R? R? 1 2 ommon physical addresses Write-Through? 0 St or e Shared portion of address space rivate portion of address space 2 pr i vat e 1 pr i vat e 0 pr i vat e dd hoc parallelism used in system code ost parallel applications have structured SS Same program on each processor shared variable X means the same thing to each thread aches are aliases for memory locations Does every processor eventually see new value? Tightly related: ache onsistency In what order do writes appear to other processors? Buses make this easy: every processor can snoop on every write Essential feature: Broadcast iss Lec 2.11 Lec 2.12

4 Engineering: Intel entium ro Quad Engineering: SUN Enterprise U -ro -ro -ro Interrupt 256-KB module module module controller L 2 Bus interface -ro bus (64-bit data, 36-bit address, 66 Hz) 2 2 em ctrl U/mem cards I bridge I bridge emory controller Bus interface/switch I cards I bus I bus IU 1-, 2-, or 4-way interleaved DR ll coherence and multiprocessing glue in processor module Highly integrated, targeted at high volume Low latency and bandwidth roc + mem card - card 16 cards of either type ll memory accessed over bus, so symmetric Higher bandwidth, higher latency bus Gigaplane bus (256 data, 41 address, 83 Hz) 100bT, SSI Bus interface SBUS SBUS SBUS 2 Fiberhannel cards Lec 2.13 Lec 2.14 Quad-rocessor Xeon rchitecture Scaling Up Omega General Dance hall Distributed memory ll sharing through pairs of front side busses (FSB) emory traffic/cache misses through single chipset to memory Example Blackford chipset roblem is interconnect: cost (crossbar) or bandwidth (bus) Dance-hall: bandwidth still scalable, but lower cost than crossbar» latencies to memory uniform, but uniformly large Distributed memory or non-uniform memory access (NU)» onstruct shared address space out of simple message transactions across a general-purpose network (e.g. readrequest, read-response) aching shared (particularly nonlocal) data? Lec 2.15 Lec 2.16

5 Stanford DSH The IT lewife ultiprocessor lusters of 4 processors share 2 nd -level cache Up to 16 clusters tied together with 2-dim mesh 16-bit directory associated with every memory line Each memory line has home cluster that contains DR The 16-bit vector says which clusters (if any) have read copies Only one writer permitted at a time Never got more than 12 clusters (48 processors) working at one time: synchronous network probs! L1- L1- L1- L2-ache L1- Lec 2.17 ache-coherence Shared emory artially in Software! Limited Directory + software overflow User-level essage-assing Rapid ontext-switching 2-dimentional synchronous network One node/board Got 32-processors (+ boards) working Lec 2.18 Engineering: ray T3E D Direct onnect External em em ctrl and NI XY Switch Z Scale up to 1024 processors, 480B/s links emory controller generates request message for non-local references No hardware mechanism for coherence» SGI Origin etc. provide this Lec 2.19 ommunication over general interconnect Shared memory/address space traffic over network traffic to memory over network ultiple topology options (seems to scale to 8 or 16 processor chips) Lec 2.20

6 What is underlying Shared emory?? essage assing rchitectures Systolic rrays Generic rchitecture SID essage assing Dataflow Shared emory acket switched networks better utilize available link bandwidth than circuit switched networks So, network passes messages around! omplete computer as building block, including ommunication via Explicit operations rogramming model direct access only to private address space (local memory), communication via explicit messages (send/receive) High-level block diagram ommunication integration?» em,, LN, luster Easier to build and scale than SS rogramming model more removed from basic hardware operations Library or OS intervention Lec 2.21 Lec 2.22 essage-assing bstraction atch Receive Y,,t ddressy Send X, Q, t Evolution of essage-assing achines Early machines: FIFO on each link HW close to prog. odel; synchronous ops topology central (hypercube algorithms) ddressx Local process address space Local process address space rocess rocessq Send specifies buffer to be transmitted and receiving process Recv specifies sending process and application storage to receive into emory to memory copy, but need to name processes Optional tag on send and matching rule on receive User process names local data and entities in process/tag space too In simplest form, the send/recv match achieves pairwise synch event» Other variants too any overheads: copying, buffer management, protection altech osmic ube (Seitz, Jan 95) Lec 2.23 Lec 2.24

7 IT J-achine (Jelly-bean machine) 3-dimensional network topology Non-adaptive, E-cubed routing Hardware routing aximize density of communication 64-nodes/board, 1024 nodes total Low-powered processors essage passing instructions ssociative array primitives to aid in synthesizing shared-address space Extremely fine-grained communication Hardware-supported ctive essages Lec 2.25 Diminishing Role of Topology? Shift to general links D, enabling non-blocking ops» Buffered by system at destination until recv Store&forward routing Fault-tolerant, multi-path routing: Diminishing role of topology ny-to-any pipelined routing node-network interface dominates communication time» fast relative to overhead» Will this change for anyore? Simplifies programming llows richer design space» grids vs hypercubes Intel is/1 -> is/2 -> is/860 Lec 2.26 Example Intel aragon Building on the mainstream: IB S-2 Sandia s Intel aragon X/S-based Supercomputer i860 L 1 i860 L 1 emory bus (64-bit, 50 Hz) em ctrl 4-way interleaved DR Driver Intel aragon node D NI ade out of essentially complete RS6000 workstations interface integrated in bus (bw limited by bus) General interconnection network formed from 8-port switches ower 2 U IB S-2 node L 2 emory bus emory 4-way interleaved controller DR 2D grid network with processing node attached to every switch 8 bits, 175 Hz, bidirectional icrohannel bus NI D i860 NI DR Lec 2.27 Lec 2.28

8 Berkeley NOW Data arallel Systems 100 Sun Ultra2 workstations Inteligent network interface proc + mem yrinet 160 B/s per link 300 ns per hop rogramming model Operations performed in parallel on each element of data structure Logically single thread of control, performs sequential or parallel steps onceptually, a processor associated with each data element rchitectural model rray of many simple, cheap processors with little memory each» rocessors don t sequence through instructions ttached to a control processor that issues instructions Specialized and general communication, cheap global synchronization Original motivations atches simple differential equation solvers entralize high cost of instruction fetch/sequencing ontrol processor E E E E E E Lec 2.29 E E E Lec 2.30 pplication of Data arallelism onnection achine Each E contains an employee record with his/her salary If salary > 100K then salary = salary *1.05 else salary = salary *1.10 Logically, the whole operation is a single step Some processors enabled for arithmetic operation, others disabled Other examples: Finite differences, linear algebra,... Document searching, graphics, image processing,... Some recent machines: Thinking achines -1, -2 (and -5) aspar -1 and -2, (Tucker, IEEE omputer, ug. 1988) Lec 2.31 Lec 2.32

9 NVidia Tesla rchitecture ombined GU and general U omponents of NVidia Tesla architecture S has 8 S thread processor cores 32 GFLOS peak at 1.35 GHz IEEE bit floating point 32-bit, 64-bit integer 2 SFU special function units Scalar IS emory load/store/atomic Texture fetch Branch, call, return Barrier synchronization instruction ultithreaded Instruction Unit 768 independent threads per S HW multithreading & scheduling 16KB Shared emory oncurrent threads share data Low latency load/store Full GU Total performance > 500GOps Lec 2.33 Lec 2.34 Evolution and onvergence -5 SID opular when cost savings of centralized sequencer high 60s when U was a cabinet Replaced by vectors in mid-70s» ore flexible w.r.t. memory layout and easier to manage Revived in mid-80s when 32-bit datapath slices just fit on chip Simple, regular applications have good locality rogramming model converges with SD (single program multiple data) need fast global synchronization Structured global address space, implemented with either SS or Repackaged SparcStation 4 per board Fat-Tree network ontrol network for global synchronization Lec 2.35 Lec 2.36

10 Systolic rrays Dataflow Generic SID rchitecture essage assing Shared emory Dataflow rchitectures Represent computation as a graph of essential dependences Logical processor at each node, activated by availability of operands essage (tokens) carrying tag of next instruction sent to next processor Tag compared with others in matching store; match fires execution a = (b +1) (b c) d = c e f = a d 1 b + a f c d e Dataflow graph Lec 2.37 Token store Waiting atching rogram store Instruction fetch Execute Token queue Form token 1/28/08 Kubiatowicz S258 UB Spring 2008 onsoon (IT) Lec 2.38 Evolution and onvergence Systolic rchitectures Key characteristics bility to name operations, synchronization, dynamic scheduling roblems Operations have locality across them, useful to group together Handling complex data structures like arrays omplexity of matching store and memory units Expose too much parallelism (?) onverged to use conventional processors and memory Support for large, dynamic set of threads to map to processors Typically shared address space as well But separation of progr. model from hardware (like data-parallel) Lasting contributions: Integration of communication with thread (handler) generation Tightly integrated communication and fine-grained synchronization Remained useful concept for software (compilers etc.) VLSI enables inexpensive special-purpose chips Represent algorithms directly by chips connected in regular pattern Replace single processor with array of regular processing elements Orchestrate data flow for high throughput with less memory access E Different from pipelining Nonlinear array structure, multidirection data flow, each E may have (small) local instruction and data memory SID? : each E may do something different E E E Lec 2.39 Lec 2.40

11 Systolic rrays (contd.) Example: Systolic array for 1-D convolution x8 x7 y(i) = w1 x(i) + w2 x(i + 1) + w3 x(i + 2) + w4 x(i + 3) x6 x5 x4 y3 y2 y1 x3 w4 xin yin x2 x w ractical realizations (e.g. iwr) use quite general processors» Enable variety of algorithms on same hardware But dedicated interconnect channels» Data transfer directly from register to register across channel Specialized, and same problems as SID» General purpose systems work well for same algorithms (locality etc.) x1 w3 w2 w1 xout yout xout = x x= xin yout = yin + w xin Toward rchitectural onvergence Evolution and role of software have blurred boundary Send/recv supported on SS machines via buffers an construct global address space on (G -> L) age-based (or finer-grained) shared virtual memory Hardware organization converging too Tighter NI integration even for (low-latency, high-bandwidth) Hardware SS passes messages Even clusters of workstations/ss are parallel systems Emergence of fast system area networks (SN) rogramming models distinct, but organizations converging Nodes connected by general network and communication assists Implementations also converging, at least in high-end machines Lec 2.41 Lec 2.42 onvergence: Generic arallel rchitecture em ommunication assist () Node: processor(s), memory system, plus communication assist interface and communication controller Scalable network onvergence allows lots of innovation, within framework Integration of assist with node, what operations, how efficiently... Flynn s Taxonomy # instruction x # Data Single Instruction Single Data (SISD) Single Instruction ultiple Data (SID) ultiple Instruction Single Data ultiple Instruction ultiple Data (ID) Everything is ID! However Question is one of efficiency How easily (and at what power!) can you do certain operations? GU solution from NVIDI good at graphics is it good in general? s (ore?) Important: communication architecture How do processors communicate with one another How does the programmer build correct programs? Lec 2.43 Lec 2.44

12 ny hope for us to do research in multiprocessing? Yes: FGs as New Research latform s ~ 25 Us can fit in Field rogrammable Gate rray (FG), 1000-U system from ~ 40 FGs? 64-bit simple soft core RIS at 100Hz in 2004 (Virtex- II) FG generations every 1.5 yrs; 2X Us, 2X clock rate HW research community does logic design ( gate shareware ) to create out-of-thebox, assively arallel rocessor runs standard binaries of OS, apps Gateware: rocessors, aches, oherency, Ethernet Interfaces, Switches, Routers, (IB, Sun have donated processors) E.g., 1000 processor, IB ower binary-compatible, cache-coherent 200 Hz; fast enough for research Lec 2.45 R Since goal is to ramp up research in multiprocessing, called Research ccelerator for ultiple rocessors To learn more, read R: Research ccelerator for ultiple rocessors - ommunity Vision for a Shared Experimental arallel HW/SW latform, Technical Report UB//SD , Sept 2005 Web page ramp.eecs.berkeley.edu roject Opportunities? any Infrastructure development for research Validation against simulators/real systems Development of new communication features Etc. Lec 2.46 Why R Good for Research? ost (1000 Us) ost of ownership Scalability ower/space (kilowatts, racks) ommunity Observability Reproducibility Flexibility redibility erform. (clock) G S F (40) D (120 kw, 12 racks) D (120 kw, 12 racks) D D B D + (2 GHz) luster (2) D D + (3 GHz) B- Simulate + (0) + (.1 kw, 0.1 racks) F F (0 GHz) B R (0.1) (1.5 kw, 0.3 racks) (0.2 GHz) - Lec 2.47 R 1 Hardware ompleted Dec (14x17 inch 22-layer B) odule: FGs, memory, 10GigE conn. ompact Flash dministration/ maintenance ports:» 10/100 Enet» HDI/DVI» USB ~4K/module w/o FGs or DR alled BEE2 for Berkeley Emulation Engine 2 Lec 2.48

13 R Blue rototype (1/07) 8 icroblaze cores / FG 8 BEE2 modules (32 user FGs) x 4 FGs/module = Hz Full star-connection between modules It works; runs NS benchmarks Us are softcore icroblazes (32-bit Xilinx RIS architecture) Lec 2.49 Vision: ultiprocessing Watering Hole R arallel file system Dataflow language/computer Data center in a box Thread scheduling Security enhancements Internet in a box ultiprocessor switch design Router design ompile to FG Fault insertion to check dependability arallel languages R attracts many communities to shared artifact ross-disciplinary interactions ccelerate innovation in multiprocessing R as next Standard Research latform? (e.g., VX/BSD Unix in 1980s, x86/linux in 1990s) Lec 2.50 onclusion Several major types of communication: Shared emory essage assing Data-arallel Systolic DataFlow Is communication Turing-complete? an simulate each of these on top of the other! any tradeoffs in hardware support ommunication is a first-class citizen! How to perform communication is essential» IS IT ILIIT or EXLIIT? What to do with communication errors? Does locality matter??? How to synchronize? Lec 2.51

Learning Curve for Parallel Applications. 500 Fastest Computers

Learning Curve for Parallel Applications. 500 Fastest Computers Learning Curve for arallel Applications ABER molecular dynamics simulation program Starting point was vector code for Cray-1 145 FLO on Cray90, 406 for final version on 128-processor aragon, 891 on 128-processor

More information

Evolution and Convergence of Parallel Architectures

Evolution and Convergence of Parallel Architectures History Evolution and Convergence of arallel Architectures Historically, parallel architectures tied to programming models Divergent architectures, with no predictable pattern of growth. Todd C. owry CS

More information

ECE 669 Parallel Computer Architecture

ECE 669 Parallel Computer Architecture ECE 669 arallel Computer Architecture Lecture 2 Architectural erspective Overview Increasingly attractive Economics, technology, architecture, application demand Increasingly central and mainstream arallelism

More information

Parallel Programming Models and Architecture

Parallel Programming Models and Architecture Parallel Programming Models and Architecture CS 740 September 18, 2013 Seth Goldstein Carnegie Mellon University History Historically, parallel architectures tied to programming models Divergent architectures,

More information

Convergence of Parallel Architecture

Convergence of Parallel Architecture Parallel Computing Convergence of Parallel Architecture Hwansoo Han History Parallel architectures tied closely to programming models Divergent architectures, with no predictable pattern of growth Uncertainty

More information

Three basic multiprocessing issues

Three basic multiprocessing issues Three basic multiprocessing issues 1. artitioning. The sequential program must be partitioned into subprogram units or tasks. This is done either by the programmer or by the compiler. 2. Scheduling. Associated

More information

Number of processing elements (PEs). Computing power of each element. Amount of physical memory used. Data access, Communication and Synchronization

Number of processing elements (PEs). Computing power of each element. Amount of physical memory used. Data access, Communication and Synchronization Parallel Computer Architecture A parallel computer is a collection of processing elements that cooperate to solve large problems fast Broad issues involved: Resource Allocation: Number of processing elements

More information

Three parallel-programming models

Three parallel-programming models Three parallel-programming models Shared-memory programming is like using a bulletin board where you can communicate with colleagues. essage-passing is like communicating via e-mail or telephone calls.

More information

CPS104 Computer Organization and Programming Lecture 20: Superscalar processors, Multiprocessors. Robert Wagner

CPS104 Computer Organization and Programming Lecture 20: Superscalar processors, Multiprocessors. Robert Wagner CS104 Computer Organization and rogramming Lecture 20: Superscalar processors, Multiprocessors Robert Wagner Faster and faster rocessors So much to do, so little time... How can we make computers that

More information

Parallel Architecture Fundamentals

Parallel Architecture Fundamentals arallel Architecture Fundamentals Topics CS 740 September 22, 2003 What is arallel Architecture? Why arallel Architecture? Evolution and Convergence of arallel Architectures Fundamental Design Issues What

More information

NOW Handout Page 1. Recap: Gigaplane Bus Timing. Scalability

NOW Handout Page 1. Recap: Gigaplane Bus Timing. Scalability Recap: Gigaplane Bus Timing 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Address Rd A Rd B Scalability State Arbitration 1 4,5 2 Share ~Own 6 Own 7 A D A D A D A D A D A D A D A D CS 258, Spring 99 David E. Culler

More information

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Elements of a Parallel Computer Hardware Multiple processors Multiple

More information

Parallel Computing Platforms

Parallel Computing Platforms Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu)

More information

Parallel Programming Platforms

Parallel Programming Platforms arallel rogramming latforms Ananth Grama Computing Research Institute and Department of Computer Sciences, urdue University ayg@cspurdueedu http://wwwcspurdueedu/people/ayg Reference: Introduction to arallel

More information

Cray XE6 Performance Workshop

Cray XE6 Performance Workshop Cray XE6 erformance Workshop odern HC Architectures David Henty d.henty@epcc.ed.ac.uk ECC, University of Edinburgh Overview Components History Flynn s Taxonomy SID ID Classification via emory Distributed

More information

Convergence of Parallel Architectures

Convergence of Parallel Architectures History Historically, parallel architectures tied to programming models Divergent architectures, with no predictable pattern of growth. Systolic Arrays Dataflow Application Software System Software Architecture

More information

COSC 6374 Parallel Computation. Parallel Computer Architectures

COSC 6374 Parallel Computation. Parallel Computer Architectures OS 6374 Parallel omputation Parallel omputer Architectures Some slides on network topologies based on a similar presentation by Michael Resch, University of Stuttgart Spring 2010 Flynn s Taxonomy SISD:

More information

CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Introduction (Chapter 1)

CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Introduction (Chapter 1) CS/ECE 757: Advanced Computer Architecture II (arallel Computer Architecture) Introduction (Chapter 1) Copyright 2001 Mark D. Hill University of Wisconsin-Madison Slides are derived from work by Sarita

More information

Conventional Computer Architecture. Abstraction

Conventional Computer Architecture. Abstraction Conventional Computer Architecture Conventional = Sequential or Single Processor Single Processor Abstraction Conventional computer architecture has two aspects: 1 The definition of critical abstraction

More information

Scalable Multiprocessors

Scalable Multiprocessors arallel Computer Organization and Design : Lecture 7 er Stenström. 2008, Sally A. ckee 2009 Scalable ultiprocessors What is a scalable design? (7.1) Realizing programming models (7.2) Scalable communication

More information

Limitations of Memory System Performance

Limitations of Memory System Performance Slides taken from arallel Computing latforms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar! " To accompany the text ``Introduction to arallel Computing'', Addison Wesley, 2003. Limitations

More information

COSC 6374 Parallel Computation. Parallel Computer Architectures

COSC 6374 Parallel Computation. Parallel Computer Architectures OS 6374 Parallel omputation Parallel omputer Architectures Some slides on network topologies based on a similar presentation by Michael Resch, University of Stuttgart Edgar Gabriel Fall 2015 Flynn s Taxonomy

More information

Lecture 28 Introduction to Parallel Processing and some Architectural Ramifications. Flynn s Taxonomy. Multiprocessing.

Lecture 28 Introduction to Parallel Processing and some Architectural Ramifications. Flynn s Taxonomy. Multiprocessing. 1 2 Lecture 28 Introduction to arallel rocessing and some Architectural Ramifications 3 4 ultiprocessing Flynn s Taxonomy Flynn s Taxonomy of arallel achines How many Instruction streams? How many Data

More information

Chapter 9 Multiprocessors

Chapter 9 Multiprocessors ECE200 Computer Organization Chapter 9 Multiprocessors David H. lbonesi and the University of Rochester Henk Corporaal, TU Eindhoven, Netherlands Jari Nurmi, Tampere University of Technology, Finland University

More information

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently

More information

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed

More information

Scalable Cache Coherent Systems

Scalable Cache Coherent Systems NUM SS Scalable ache oherent Systems Scalable distributed shared memory machines ssumptions: rocessor-ache-memory nodes connected by scalable network. Distributed shared physical address space. ommunication

More information

Chapter 2: Computer-System Structures. Hmm this looks like a Computer System?

Chapter 2: Computer-System Structures. Hmm this looks like a Computer System? Chapter 2: Computer-System Structures Lab 1 is available online Last lecture: why study operating systems? Purpose of this lecture: general knowledge of the structure of a computer system and understanding

More information

CS Parallel Algorithms in Scientific Computing

CS Parallel Algorithms in Scientific Computing CS 775 - arallel Algorithms in Scientific Computing arallel Architectures January 2, 2004 Lecture 2 References arallel Computer Architecture: A Hardware / Software Approach Culler, Singh, Gupta, Morgan

More information

Uniprocessor Computer Architecture Example: Cray T3E

Uniprocessor Computer Architecture Example: Cray T3E Chapter 2: Computer-System Structures MP Example: Intel Pentium Pro Quad Lab 1 is available online Last lecture: why study operating systems? Purpose of this lecture: general knowledge of the structure

More information

CS 770G - Parallel Algorithms in Scientific Computing Parallel Architectures. May 7, 2001 Lecture 2

CS 770G - Parallel Algorithms in Scientific Computing Parallel Architectures. May 7, 2001 Lecture 2 CS 770G - arallel Algorithms in Scientific Computing arallel Architectures May 7, 2001 Lecture 2 References arallel Computer Architecture: A Hardware / Software Approach Culler, Singh, Gupta, Morgan Kaufmann

More information

Dr. Joe Zhang PDC-3: Parallel Platforms

Dr. Joe Zhang PDC-3: Parallel Platforms CSC630/CSC730: arallel & Distributed Computing arallel Computing latforms Chapter 2 (2.3) 1 Content Communication models of Logical organization (a programmer s view) Control structure Communication model

More information

Lecture 17: Parallel Architectures and Future Computer Architectures. Shared-Memory Multiprocessors

Lecture 17: Parallel Architectures and Future Computer Architectures. Shared-Memory Multiprocessors Lecture 17: arallel Architectures and Future Computer Architectures rof. Kunle Olukotun EE 282h Fall 98/99 1 Shared-emory ultiprocessors Several processors share one address space» conceptually a shared

More information

Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core. Prof. Onur Mutlu Carnegie Mellon University

Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core. Prof. Onur Mutlu Carnegie Mellon University 18-742 Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core Prof. Onur Mutlu Carnegie Mellon University Research Project Project proposal due: Jan 31 Project topics Does everyone have a topic?

More information

Processor Architecture and Interconnect

Processor Architecture and Interconnect Processor Architecture and Interconnect What is Parallelism? Parallel processing is a term used to denote simultaneous computation in CPU for the purpose of measuring its computation speeds. Parallel Processing

More information

Parallel Architectures

Parallel Architectures Parallel Architectures Part 1: The rise of parallel machines Intel Core i7 4 CPU cores 2 hardware thread per core (8 cores ) Lab Cluster Intel Xeon 4/10/16/18 CPU cores 2 hardware thread per core (8/20/32/36

More information

Parallel Computer Architecture

Parallel Computer Architecture arallel Computer Architecture CS 472 Concurrent & arallel rogramming University of Evansville Selection of slides from CIS 410/510 Introduction to arallel Computing Department of Computer and Information

More information

History of Distributed Systems. Joseph Cordina

History of Distributed Systems. Joseph Cordina History of Distributed Systems Joseph Cordina joseph.cordina@um.edu.mt otivation Computation demands were always higher than technological status quo Obvious answer Several computing elements working in

More information

Chap. 4 Multiprocessors and Thread-Level Parallelism

Chap. 4 Multiprocessors and Thread-Level Parallelism Chap. 4 Multiprocessors and Thread-Level Parallelism Uniprocessor performance Performance (vs. VAX-11/780) 10000 1000 100 10 From Hennessy and Patterson, Computer Architecture: A Quantitative Approach,

More information

Platforms Design Challenges with many cores

Platforms Design Challenges with many cores latforms Design hallenges with many cores Raj Yavatkar, Intel Fellow Director, Systems Technology Lab orporate Technology Group 1 Environmental Trends: ell 2 *Other names and brands may be claimed as the

More information

EE382 Processor Design. Illinois

EE382 Processor Design. Illinois EE382 Processor Design Winter 1998 Chapter 8 Lectures Multiprocessors Part II EE 382 Processor Design Winter 98/99 Michael Flynn 1 Illinois EE 382 Processor Design Winter 98/99 Michael Flynn 2 1 Write-invalidate

More information

Computer Systems Architecture

Computer Systems Architecture Computer Systems Architecture Lecture 24 Mahadevan Gomathisankaran April 29, 2010 04/29/2010 Lecture 24 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

CMSC 411 Computer Systems Architecture Lecture 21 Multiprocessors 3

CMSC 411 Computer Systems Architecture Lecture 21 Multiprocessors 3 MS 411 omputer Systems rchitecture Lecture 21 Multiprocessors 3 Outline Review oherence Write onsistency dministrivia Snooping Building Blocks Snooping protocols and examples oherence traffic and performance

More information

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems Reference Papers on SMP/NUMA Systems: EE 657, Lecture 5 September 14, 2007 SMP and ccnuma Multiprocessor Systems Professor Kai Hwang USC Internet and Grid Computing Laboratory Email: kaihwang@usc.edu [1]

More information

Multiprocessor Interconnection Networks

Multiprocessor Interconnection Networks Multiprocessor Interconnection Networks Todd C. Mowry CS 740 November 19, 1998 Topics Network design space Contention Active messages Networks Design Options: Topology Routing Direct vs. Indirect Physical

More information

Scalable Distributed Memory Machines

Scalable Distributed Memory Machines Scalable Distributed Memory Machines Goal: Parallel machines that can be scaled to hundreds or thousands of processors. Design Choices: Custom-designed or commodity nodes? Network scalability. Capability

More information

PARALLEL COMPUTER ARCHITECTURES

PARALLEL COMPUTER ARCHITECTURES 8 ARALLEL COMUTER ARCHITECTURES 1 CU Shared memory (a) (b) Figure 8-1. (a) A multiprocessor with 16 CUs sharing a common memory. (b) An image partitioned into 16 sections, each being analyzed by a different

More information

Scalable Cache Coherent Systems Scalable distributed shared memory machines Assumptions:

Scalable Cache Coherent Systems Scalable distributed shared memory machines Assumptions: Scalable ache oherent Systems Scalable distributed shared memory machines ssumptions: rocessor-ache-memory nodes connected by scalable network. Distributed shared physical address space. ommunication assist

More information

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache. Multiprocessors why would you want a multiprocessor? Multiprocessors and Multithreading more is better? Cache Cache Cache Classifying Multiprocessors Flynn Taxonomy Flynn Taxonomy Interconnection Network

More information

NOW Handout Page 1. Recap: Performance Trade-offs. Shared Memory Multiprocessors. Uniprocessor View. Recap (cont) What is a Multiprocessor?

NOW Handout Page 1. Recap: Performance Trade-offs. Shared Memory Multiprocessors. Uniprocessor View. Recap (cont) What is a Multiprocessor? ecap: erformance Trade-offs Shared ory Multiprocessors CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley rogrammer s View of erformance Speedup < Sequential Work Max (Work + Synch

More information

Multiprocessors - Flynn s Taxonomy (1966)

Multiprocessors - Flynn s Taxonomy (1966) Multiprocessors - Flynn s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) Conventional uniprocessor Although ILP is exploited Single Program Counter -> Single Instruction stream The

More information

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors? Parallel Computers CPE 63 Session 20: Multiprocessors Department of Electrical and Computer Engineering University of Alabama in Huntsville Definition: A parallel computer is a collection of processing

More information

Outline Marquette University

Outline Marquette University COEN-4710 Computer Hardware Lecture 1 Computer Abstractions and Technology (Ch.1) Cristinel Ababei Department of Electrical and Computer Engineering Credits: Slides adapted primarily from presentations

More information

A Scalable SAS Machine

A Scalable SAS Machine arallel omputer Organization and Design : Lecture 8 er Stenström. 2008, Sally. ckee 2009 Scalable ache oherence Design principles of scalable cache protocols Overview of design space (8.1) Basic operation

More information

Handout 3 Multiprocessor and thread level parallelism

Handout 3 Multiprocessor and thread level parallelism Handout 3 Multiprocessor and thread level parallelism Outline Review MP Motivation SISD v SIMD (SIMT) v MIMD Centralized vs Distributed Memory MESI and Directory Cache Coherency Synchronization and Relaxed

More information

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor

More information

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:

More information

Approaches to Building Parallel Machines. Shared Memory Architectures. Example Cache Coherence Problem. Shared Cache Architectures

Approaches to Building Parallel Machines. Shared Memory Architectures. Example Cache Coherence Problem. Shared Cache Architectures Approaches to Building arallel achines Switch/Bus n Scale Shared ory Architectures (nterleaved) First-level (nterleaved) ain memory n Arvind Krishnamurthy Fall 2004 (nterleaved) ain memory Shared Cache

More information

Cache Coherence in Scalable Machines

Cache Coherence in Scalable Machines ache oherence in Scalable Machines SE 661 arallel and Vector Architectures rof. Muhamed Mudawar omputer Engineering Department King Fahd University of etroleum and Minerals Generic Scalable Multiprocessor

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

Interconnection Network

Interconnection Network Interconnection Network Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) Topics

More information

Interconnection Network. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Interconnection Network. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University Interconnection Network Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Topics Taxonomy Metric Topologies Characteristics Cost Performance 2 Interconnection

More information

CCS HPC. Interconnection Network. PC MPP (Massively Parallel Processor) MPP IBM

CCS HPC. Interconnection Network. PC MPP (Massively Parallel Processor) MPP IBM CCS HC taisuke@cs.tsukuba.ac.jp 1 2 CU memoryi/o 2 2 4single chipmulti-core CU 10 C CM (Massively arallel rocessor) M IBM BlueGene/L 65536 Interconnection Network 3 4 (distributed memory system) (shared

More information

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Issues in Parallel Processing Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Introduction Goal: connecting multiple computers to get higher performance

More information

Chapter 1: Perspectives

Chapter 1: Perspectives Chapter 1: Perspectives Copyright @ 2005-2008 Yan Solihin Copyright notice: No part of this publication may be reproduced, stored in a retrieval system, or transmitted by any means (electronic, mechanical,

More information

What is a parallel computer?

What is a parallel computer? 7.5 credit points Power 2 CPU L 2 $ IBM SP-2 node Instructor: Sally A. McKee General interconnection network formed from 8-port switches Memory bus Memory 4-way interleaved controller DRAM MicroChannel

More information

Intro to Multiprocessors

Intro to Multiprocessors The Big Picture: Where are We Now? Intro to Multiprocessors Output Output Datapath Input Input Datapath [dapted from Computer Organization and Design, Patterson & Hennessy, 2005] Multiprocessor multiple

More information

[ 7.2.5] Certain challenges arise in realizing SAS or messagepassing programming models. Two of these are input-buffer overflow and fetch deadlock.

[ 7.2.5] Certain challenges arise in realizing SAS or messagepassing programming models. Two of these are input-buffer overflow and fetch deadlock. Buffering roblems [ 7.2.5] Certain challenges arise in realizing SAS or messagepassing programming models. Two of these are input-buffer overflow and fetch deadlock. Input-buffer overflow Suppose a large

More information

Parallel Architectures

Parallel Architectures Parallel Architectures Instructor: Tsung-Che Chiang tcchiang@ieee.org Department of Science and Information Engineering National Taiwan Normal University Introduction In the roughly three decades between

More information

CS252 Graduate Computer Architecture Lecture 14. Multiprocessor Networks March 9 th, 2011

CS252 Graduate Computer Architecture Lecture 14. Multiprocessor Networks March 9 th, 2011 CS252 Graduate Computer Architecture Lecture 14 Multiprocessor Networks March 9 th, 2011 John Kubiatowicz Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~kubitron/cs252

More information

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics CS4230 Parallel Programming Lecture 3: Introduction to Parallel Architectures Mary Hall August 28, 2012 Homework 1: Parallel Programming Basics Due before class, Thursday, August 30 Turn in electronically

More information

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors. CS 320 Ch. 17 Parallel Processing Multiple Processor Organization The author makes the statement: "Processors execute programs by executing machine instructions in a sequence one at a time." He also says

More information

NOW Handout Page 1. *T: Network Co-Processor. General Purpose Node-to-Network Interface in Scalable Multiprocessors. Dedicated Message Processor

NOW Handout Page 1. *T: Network Co-Processor. General Purpose Node-to-Network Interface in Scalable Multiprocessors. Dedicated Message Processor *T: o-rocessor General urpose Node-to- Interface in Scalable Multiprocessors S, Spring 99 David. uller omputer Science Division U.. erkeley 3//99 S S99 iwr: Systolic omputation Host Interface unit Dedicated

More information

ECE 669 Parallel Computer Architecture

ECE 669 Parallel Computer Architecture ECE 669 Parallel Computer Architecture Lecture 27 Course Wrap Up What is Parallel Architecture? A parallel computer is a collection of processing elements that cooperate to solve large problems fast Some

More information

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448 1 The Greed for Speed Two general approaches to making computers faster Faster uniprocessor All the techniques we ve been looking

More information

Parallel Computer Architecture

Parallel Computer Architecture Parallel Computer Architecture What is Parallel Architecture? A parallel computer is a collection of processing elements that cooperate to solve large problems fast Some broad issues: Resource Allocation:»

More information

Computer Systems Architecture

Computer Systems Architecture Computer Systems Architecture Lecture 23 Mahadevan Gomathisankaran April 27, 2010 04/27/2010 Lecture 23 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student

More information

Lecture 1: Introduction

Lecture 1: Introduction Lecture 1: Introduction ourse organization: 13 lectures on parallel architectures ~5 lectures on cache coherence, consistency ~3 lectures on TM ~2 lectures on interconnection networks ~2 lectures on large

More information

Parallel Architecture. Hwansoo Han

Parallel Architecture. Hwansoo Han Parallel Architecture Hwansoo Han Performance Curve 2 Unicore Limitations Performance scaling stopped due to: Power Wire delay DRAM latency Limitation in ILP 3 Power Consumption (watts) 4 Wire Delay Range

More information

Interconnection Networks

Interconnection Networks Lecture 17: Interconnection Networks Parallel Computer Architecture and Programming A comment on web site comments It is okay to make a comment on a slide/topic that has already been commented on. In fact

More information

The Sun Fireplane Interconnect in the Mid- Range Sun Fire Servers

The Sun Fireplane Interconnect in the Mid- Range Sun Fire Servers TAK IT TO TH NTH Alan Charlesworth icrosystems The Fireplane Interconnect in the id- Range Fire Servers Vertical & Horizontal Scaling any CUs in one box Cache-coherent shared memory (S) Usually proprietary

More information

EE382 Processor Design. Processor Issues for MP

EE382 Processor Design. Processor Issues for MP EE382 Processor Design Winter 1998 Chapter 8 Lectures Multiprocessors, Part I EE 382 Processor Design Winter 98/99 Michael Flynn 1 Processor Issues for MP Initialization Interrupts Virtual Memory TLB Coherency

More information

Outline. Limited Scaling of a Bus

Outline. Limited Scaling of a Bus Outline Scalability physical, bandwidth, latency and cost level of integration Realizing rogramming Models network transactions protocols safety input buffer problem: N-1 fetch deadlock Communication Architecture

More information

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

COSC 6385 Computer Architecture - Thread Level Parallelism (I) COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month

More information

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems 1 License: http://creativecommons.org/licenses/by-nc-nd/3.0/ 10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems To enhance system performance and, in some cases, to increase

More information

Scalable Multiprocessors

Scalable Multiprocessors Scalable Multiprocessors [ 11.1] scalable system is one in which resources can be added to the system without reaching a hard limit. Of course, there may still be economic limits. s the size of the system

More information

Computer Architecture Spring 2016

Computer Architecture Spring 2016 Computer Architecture Spring 2016 Lecture 19: Multiprocessing Shuai Wang Department of Computer Science and Technology Nanjing University [Slides adapted from CSE 502 Stony Brook University] Getting More

More information

Message Passing Models and Multicomputer distributed system LECTURE 7

Message Passing Models and Multicomputer distributed system LECTURE 7 Message Passing Models and Multicomputer distributed system LECTURE 7 DR SAMMAN H AMEEN 1 Node Node Node Node Node Node Message-passing direct network interconnection Node Node Node Node Node Node PAGE

More information

WHY PARALLEL PROCESSING? (CE-401)

WHY PARALLEL PROCESSING? (CE-401) PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:

More information

Parallel Arch. Review

Parallel Arch. Review Parallel Arch. Review Zeljko Zilic McConnell Engineering Building Room 536 Main Points Understanding of the design and engineering of modern parallel computers Technology forces Fundamental architectural

More information

CSE502 Graduate Computer Architecture. Lec 22 Goodbye to Computer Architecture and Review

CSE502 Graduate Computer Architecture. Lec 22 Goodbye to Computer Architecture and Review CSE502 Graduate Computer Architecture Lec 22 Goodbye to Computer Architecture and Review Larry Wittie Computer Science, StonyBrook University http://www.cs.sunysb.edu/~cse502 and ~lw Slides adapted from

More information

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to

More information

Lecture 1: Introduction

Lecture 1: Introduction Contemporary Computer Architecture Instruction set architecture Lecture 1: Introduction CprE 581 Computer Systems Architecture, Fall 2016 Reading: Textbook, Ch. 1.1-1.7 Microarchitecture; examples: Pipeline

More information

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization Computer Architecture Computer Architecture Prof. Dr. Nizamettin AYDIN naydin@yildiz.edu.tr nizamettinaydin@gmail.com Parallel Processing http://www.yildiz.edu.tr/~naydin 1 2 Outline Multiple Processor

More information

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence 1 COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence Cristinel Ababei Dept. of Electrical and Computer Engineering Marquette University Credits: Slides adapted from presentations

More information

COSC 6385 Computer Architecture - Multi Processor Systems

COSC 6385 Computer Architecture - Multi Processor Systems COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:

More information

Lecture notes for CS Chapter 4 11/27/18

Lecture notes for CS Chapter 4 11/27/18 Chapter 5: Thread-Level arallelism art 1 Introduction What is a parallel or multiprocessor system? Why parallel architecture? erformance potential Flynn classification Communication models Architectures

More information

Cache Coherence in Bus-Based Shared Memory Multiprocessors

Cache Coherence in Bus-Based Shared Memory Multiprocessors Cache Coherence in Bus-Based Shared Memory Multiprocessors Shared Memory Multiprocessors Variations Cache Coherence in Shared Memory Multiprocessors A Coherent Memory System: Intuition Formal Definition

More information