COMP 322: Principles of Parallel Programming. Lecture 18: Understanding Parallel Computers (Chapter 2, contd) Fall 2009

Size: px
Start display at page:

Download "COMP 322: Principles of Parallel Programming. Lecture 18: Understanding Parallel Computers (Chapter 2, contd) Fall 2009"

Transcription

1 COMP 322: Principles of Parallel Programming Lecture 18: Understanding Parallel Computers (Chapter 2, contd) Fall Vivek Sarkar Department of Computer Science Rice University COMP 322 Lecture October 2009

2 Acknowledgments for todayʼs lecture Course text: Principles of Parallel Programming, Calvin Lin & Lawrence Snyder Includes resources available at 0,3110, ,00.html Parallel Architectures, Calvin Lin Lectures 5 & 6, CS380P, Spring 2009, UT Austin A Gentler, Kinder Guide to the Multi-core Galaxy ECE 4100/6100 guest lecture by Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Tech Parallel Systems: Introduction, Fall 2008 Jan Lemiere, Vrije Universititeit Brussel 2

3 A Look at Six Parallel Computers (Chapter 2) Chip Multiprocessors Intel Core Duo (previous lecture) AMD Dual Core Opteron (previous lecture) Symmetric Multiprocessors Sun Fire E25K (this lecture) Heterogeneous Processors Cell processor (this lecture) GPUs (this lecture, only brief mention in Chapter 2) Clusters Supercomputers Blue Gene/L (this lecture) 3

4 Sun Fire E25K Up to 72 processors, each of which can execute 2 hardware threads and directly access 16GB memory Total 1.15TB memory accessed via a directory-based cache consistency protocol Each board contains four processors w/ a snooping bus Boards are connected by three 18x18 crossbar switches Crossbars have high bisection bandwidth, but the switch cost grows as n 2 for n boards 4

5 Demystifying Crossbar switches Figure 2.5 Crossbar switch connecting four nodes. Notice the output and input channels; crossing wires do not connect unless a connection is shown. Each pair of nodes is directly connected by setting one of the open circles. 5

6 Flynnʼs Taxonomy Single Instruction Multiple Instructions Single Data SISD MISD Multiple Data SIMD MIMD 6

7 Architecture of the Cell processor (Figure 2.6 ) 7

8 Cell Features Heterogeneous multicore system architecture Power Processor Element for control tasks Synergistic Processor Elements for dataintensive processing 16B/cycle SPE SPU SXU LS MFC SPU SXU LS MFC SPU SXU LS MFC SPU SXU LS MFC SPU SXU LS MFC SPU SXU LS MFC SPU SXU LS MFC SPU SXU LS MFC Synergistic Processor Element (SPE) consists of Synergistic Processor Unit (SPU) Synergistic Memory Flow Control (MFC) Data movement and synchronization Interface to highperformance Element Interconnect Bus 8 EIB (up to 96B/cycle) PPE L2 16B/cycle 32B/cycle PPU L1 16B/cycle PXU MIC 64-bit Power Architecture with VMX 16B/cycle Dual XDR TM BIC 16B/cycle (2x) FlexIO TM

9 Synergistic Processor Element (SPE) ISA influenced by VMX and PS2 s Emotion Engine User-mode architecture No translation/protection within SPE DMA is full PowerPC protect/xlate Direct programmer control DMA/DMA-list Branch hint No dynamic prediction In-order execution VMX-like SIMD dataflow Graphics SP-Float No saturate arith, some byte IEEE DP-Float (BlueGene-like) Unified register file 128 entry x 128 bit 256KB Local Store Combined I & D 16B/cycle L/S bandwidth 128B/cycle DMA bandwidth DMA unit w/ Memory Flow Control (MFC) commands MFC s MMU allows consistent interface to system storage map for all processors despite heterogeneous structure 9 SPU Details SPU Units (pipelined) Simple (FXU even) Add/Compare Rotate Logical, Count Leading Zero Permute (FXU odd) Permute Table-lookup FPU (Single / Double Precision) Control (SCN) Dual Issue, Load/Store, ECC Handling Channel (SSC) Interface to MFC Register File (GPR/FWD) SPU Latencies Simple fixed point - 2 cycles* Complex fixed point - 4 cycles* Load - 6 cycles* Single-precision (ER) float - 6 cycles* Integer multiply - 7 cycles* Branch miss (no penalty for correct hint) - 20 cycles DP (IEEE) float (partially pipelined) - 13 cycles* Enqueue DMA Command - 20 cycles*

10 10 Memory Flow Controller Commands DMA Commands Put - Transfer from Local Store to EA space Puts - Transfer and Start SPU execution Putr - Put Result - (Arch. Scarf into L2) Putl - Put using DMA List in Local Store Putrl - Put Result using DMA List in LS (Arch) Get - Transfer from EA Space to Local Store Gets - Transfer and Start SPU execution Getl - Get using DMA List in Local Store Sndsig - Send Signal to SPU Command Modifiers: <f,b> f: Embedded Tag Specific Fence Command will not start until all previous commands in same tag group have completed b: Embedded Tag Specific Barrier Command and all subsiquent commands in same tag group will not start until previous commands in same tag group have completed SL1 Cache Management Commands sdcrt - Data cache region touch (DMA Get hint) sdcrtst - Data cache region touch for store (DMA Put hint) sdcrz - Data cache region zero sdcrs - Data cache region store sdcrf - Data cache region flush Command Parameters LSA - Local Store Address (32 bit) EA - Effective Address (32 or 64 bit) TS - Transfer Size (16 bytes to 16K bytes) LS - DMA List Size (8 bytes to 16 K bytes) TG - Tag Group(5 bit) CL - Cache Management / Bandwidth Class Synchronization Commands Lockline (Atomic Update) Commands: getllar - DMA 128 bytes from EA to LS and set Reservation putllc - Conditionally DMA 128 bytes from LS to EA putlluc - Unconditionally DMA 128 bytes from LS to EA barrier - all previous commands complete before subsiquent commands are started mfcsync - Results of all previous commands in Tag group are remotely visible mfceieio - Results of all preceding Puts commands in same group visible with respect to succeeding Get commands

11 Power Processor Element (PPE): General purpose, 64-bit RISC processor (Power/ PowePC binary compatible) In-order, dual issue, dual threaded L1 : 32KB I ; 32KB D L2 : 512KB Coherent load / store VMX-32 Realtime Controls Locking L2 Cache & TLB Software / hardware managed TLB Bandwidth / Resource Reservation Mediated Interrupts PPE Structure 11

12 Element Interconnect Bus EIB data ring for internal communication Four unidirectional 16 byte data rings, supporting multiple transfers 2 clockwise, 2 anti-clockwise; worst-case latency is half ring length 96B/cycle peak bandwidth Over 100 outstanding requests 12

13 Example of Eight Concurrent Transactions PPE SPE1 SPE3 SPE5 SPE7 IOIF1 PPE SPE1 SPE3 SPE5 SPE7 IOIF1 Ramp 6 Ramp Ramp 7 7 Ramp Ramp 8 8 Ramp Ramp 9 9 Ramp Ramp Ramp Ramp Controller Controller Controller Controller Controller Controller Data Arbiter Controller Controller Controller Controller Controller Controller Controller Controller Controller Controller Controller Ramp Ramp 5 5 Ramp 7 Ramp 4 Ramp 4 Ramp 8 Ramp 3 Ramp 3 Ramp 9 Ramp 2 Ramp 2 Ramp 10 Ramp 1 Ramp 1 Ramp 11 Ramp 0 Ramp 0 13 Ring0 Ring2 MIC PPE SPE1 SPE0 SPE3 SPE2 SPE5 SPE4 SPE7 SPE6 BIF IOIF1 BIF / / IOIF1 IOIF0 Ring1 Ring3 controls

14 Theoretical Peak Operations FP (SP) FP (DP) Int (16 bit) Int (32 bit) 250 Billion Ops / sec Freescale MPC8641D 1.5 GHz AMD Athlon 64 X2 2.4 GHz Intel Pentium D 3.2 GHz PowerPC 970MP 2.5 GHz Cell Broadband Engine TM 3.2 GHz 14

15 CELL Software Design Considerations Four Levels of Parallelism Blade Level: Two Cell processors per blade Chip Level: 9 cores run independent tasks Instruction level: Dual issue pipelines on each SPE Register level: Native SIMD on SPE and PPE VMX 256KB local store per SPE: data + code + stack Communication DMA and Bus bandwidth DMA granularity 128 bytes DMA bandwidth among LS and System memory Traffic control Exploit computational complexity and data locality to lower data traffic requirement Shared memory / Message passing abstraction overhead Synchronization DMA latency handling 15

16 Typical CELL Software Development Flow Algorithm complexity study Data layout/locality and Data flow analysis Experimental partitioning and mapping of the algorithm and program structure to the architecture Develop PPE Control, PPE Scalar code Develop PPE Control, partitioned SPE scalar code Communication, synchronization, latency handling Transform SPE scalar code to SPE SIMD code Re-balance the computation / data movement Other optimization considerations PPE SIMD, system bottleneck, load balance 16

17 Programming the cell is challenging Issues Dividing program among different cores Creating instructions in a different language for the 8 SPEs than for the PowerPC core. Need to think in terms of SIMD nature of dataflow to get maximum performance from SPUs SPU local store needs to perform coherent DMA access for accessing system memory 17

18 Shared Memory Processor CBE can be explicitly programmed as a shared-memory multiprocessor using two different instruction sets The SPEs and the PPE can be programmed to fully inter-operate in a cache-coherent Shared-Memory Multiprocessor Model Cache-coherent DMA operations for SPEs DMA operations use effective address common to all PPE and SPEs SPE shared-memory store instructions are replaced A store from the register file to the LS DMA operation from LS to shared memory SPE shared-memory load instructions are replaced DMA operation from shared memory to LS A load from LS to register file Of course a compiler could provide much of this functionality. 18

19 Compiling a single source file for the Cell (w/o buffers) foo1 (); #pragma omp parallel for for (i=0; i < N; i++) A[i] = x * B[i]; foo2 (); Single source outline foo3(lb,ub) for (i=lb; i < UB; i++) A[i] = x * B[i]; Runtime barrier foo3_spu (LB,UB) for (i=lb; i < UB; i++) A[i] = x * B[i]; Runtime barrier foo1 (); Runtime distribution of work: invoke foo3, for i=[0,n) Runtime barrier foo2 (); In SPE code: A, B, and x are shared 19

20 Compiling a single source file for the Cell (w/ buffers) foo1 (); #pragma omp parallel for for (i=0; i < N; i++) A[i] = x * B[i]; foo2 (); Single source outline foo3(lb,ub) for (i=lb; i < UB; i++) A[i] = x * B[i]; Runtime barrier foo3_spu (LB,UB) /** buffers A [M], B [M] **/ 20 foo1 (); Runtime distribution of work: invoke foo3 and foo3_spu, for i=[0,n) Runtime barrier foo2 (); for ( k=lb; k < UB; k+=m) { DMA M elements of B into B for (j=0; j<m; j++) { A [j] = cache_lookup(x) * B [j]; } DMA M elements of A out of A } Runtime barrier

21 Data Partitioning Single Source assumption: all data lives in System Memory Naïve implementation, every load and store requires a dma operation Too costly (~700 cycles per load or store) MP will require locking on every reference What can be done to make this acceptable? 21

22 Example: Prefetching Software Pipelined Prefetch dma_get(b,b[0],400); dma_get(c,c[0],400); Blocked, with prefetch for(i=0;i<99900;i+=100) { Original Code dma_get(b,b[i+100],400); for(i=0;i<100000;i+=100) { dma_get(c,c[i+100],400); for(i=0;i<100000;i++) dma_get(b,b[i],400); for(ii=0;ii<100;ii++) a[i]=b[i]+c[i]; dma_get(c,c[i],400); a [ii]=b [ii]+c [ii]; for(ii=0;ii<100;ii++) dma_put(a[i],a,400); a [ii]=b [ii]+c [ii]; swap(a,a ); dma_put(a[i],a,400); swap(b,b ); } swap(c,c ); } for(ii=0;ii<100;ii++) a [ii]=b [ii]+c [ii]; dma_put(a[i],a,400); 22

23 Irregular Accesses What do we do about this? for(i=0;i<100000;i++) a[i]=b[i]+c[i]*d[f(i)]; b and c can be prefetched, but d has an irregular access pattern, thus we cannot predict what elements of d are required we seem to be thrown back on the naïve implementation, d[f(i)] must be fetched on each iteration with a consequent large slowdown of the loop observation: it s as if every access to d incurred a cache miss 23

24 Software Caching Original Code for(i=0;i<100000;i++) = d[f(i)]; Code with explicit Cache Lookup for(i=0;i<100000;i++) t=cache_lookup(d[f(i)]; = t; inline vector cache_lookup(addr) { if (cache_directory[addr&key_mask]!= (addr&tag_mask)) miss_handler(addr); return cache_data[addr&key_mask][addr&offset_mask]; } the miss handler will dma the required data, and some suitable quantity of surrounding data higher degrees of associativity can be supported, possibly for little extra cost on a SIMD processor 24

25 Combining Prefetch with Software Cache Original Code for(i=0;i<100000;i++) a[i]=b[i]+c[i]*b[f(i)]; Prefetching and Caching for(i=0;i<100000;i+=100) { dma_get(b,b[i],400); dma_get(c,c[i],400); for(ii=0;ii<100;ii++) { t=cache_look_up(b[f(i)]); a [ii]=b [ii]+c [ii]*t; } dma_put(a[i],a,400); } Prefetching must also update the cache directory, and Miss handling must not evict prefetched data 25

26 Coherence Problem SPE accesses data in global memory through two mechanisms: Software controlled cache Static buffers Incorrect value may be used or generated if coherence is not maintained. Examples: Two copies of data in software controlled cache and static buffer. One changes the value and the other one may read a stale value Multiple copies of data in different static buffers Approaches: Compiler: no runtime overhead, Runtime: more powerful but complicated 26

27 Solution Overview Combine two approaches for optimal solution Try to apply compiler solution as much as possible Resort to runtime solution if necessary Components Local coherence simplification Global coherence avoidance analysis Dynamic coherence maintenance 27

28 Local Coherence Simplification Runtime coherence maintenance is needed only At the entry of loop: DMA read and check whether the software controlled cache has updated data At the exit of loop: Write-through: update the hit cache line and DMA write Write-back: put the static buffer content into cache Pros/Cons Requires local data dependence info, which may be more likely to be available The structure of software controlled cache remains unchanged References are put into static buffer in a loop only when there is no data dependence between the reference and any other reference accessed by software controlled cache or another static buffer in the loop. The coherence maintenance can be overlapped with DMA operations Candidates for static buffer may be lost if the data dependence information is too conservative 28

29 Global Coherence Avoidance Analysis Runtime coherence maintenance can be avoided by compiler analysis At entry: if there is no updated cache line for this static buffer At exit: if there is no cache line for this static buffer already in cache that will be referenced later How the compiler predicts cache contents No lines in cache after flush If data is carefully aligned or padded, compiler can assume different variables will never be in the same cache line Can not predict the replacement. A line will be assumed to stay in cache until flush 29

30 Optimization with Flushes When runtime coherence maintenance is needed by the previous analysis, it may be profitable to insert extra cache flushes to avoid the coherence maintenance Flush can be a flush for one variable or combine them as flush all The previous analysis can provide information about the possible insertion points for flush Move in the control flow graph to reduce the overhead Similar to the algorithm of partial redundant elimination. Branch profiling may help 30

31 Why GPUs? Two major trends 1. Increasing performance gap relative to mainstream CPUs Calculation: 367 GFLOPS vs. 32 GFLOPS Memory Bandwidth: 86.4 GB/s vs. 8.4 GB/s 2. Availability of more general (non-graphics) programming interfaces 31 GPU in every PC and workstation massive volume and potential impact

32 What is GPGPU? General Purpose computation using GPU in applications other than 3D graphics GPU accelerates critical path of application Data parallel algorithms leverage GPU attributes Large data arrays, streaming throughput Fine-grain SIMD parallelism Low-latency floating point (FP) computation Applications see GPGPU.org Game effects (FX) physics, image processing Physical modeling, computational engineering, matrix algebra, convolution, correlation, sorting 32

33 Traditional vs. General Purpose GPUs Traditional graphics pipeline (Figure 10.3) General-purpose GPU (Figure 10.4(b)) 33

34 Nvidia GeForce 8800 GTX (a.k.a. G80) The device is a set of 16 multiprocessors Each multiprocessor is a set of 32-bit processors with a Single Instruction Multiple Data architecture shared instruction unit Each multiprocessor has: bit registers per processor Device Multiprocessor N Multiprocessor 2 Multiprocessor 1 Shared Memory Registers Registers Processor 1 Processor 2 Registers Processor M Instruction Unit 16KB on-chip shared memory per multiprocessor A read-only constant cache Constant Cache A read-only texture cache Texture Cache Device memory 34

35 Thread Batching: Grids and Blocks 35 A kernel is executed as a grid of thread blocks All threads share data memory space A thread block is a batch of threads that can cooperate with each other by: Synchronizing their execution For hazard-free shared memory accesses Efficiently sharing data through a low latency shared memory Two threads from two different blocks cannot cooperate Host Kernel 1 Kernel 2 Block (1, 1) Thread (0, 0) Thread (0, 1) Thread (0, 2) Device Thread (1, 0) Thread (1, 1) Thread (1, 2) Grid 1 Block (0, 0) Block (0, 1) Grid 2 Thread (2, 0) Thread (2, 1) Thread (2, 2) Block (1, 0) Block (1, 1) Thread (3, 0) Thread (3, 1) Thread (3, 2) Courtesy: NDVIA Thread (4, 0) Thread (4, 1) Thread (4, 2) Block (2, 0) Block (2, 1)

36 Device Memory Space Overview Each thread can: R/W per-thread registers R/W per-thread local memory R/W per-block shared memory R/W per-grid global memory Read only per-grid constant memory Read only per-grid texture memory (Device) Grid Block (0, 0) Shared Memory Registers Registers Block (1, 0) Shared Memory Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) The host can R/W global, constant, and texture memories These memory spaces are persistent across kernels called by the same application. Host Local Memory Global Memory Constant Memory Local Memory Local Memory Local Memory 36 Texture Memory

37 CUDA Host-Device Data Transfer cudaerror_t cudamemcpy(void* dst, const void* src, size_t count, enum cudamemcpykind kind) copies count bytes from the memory area pointed to by src to the memory area pointed to by dst, where kind is one of cudamemcpyhosttohost cudamemcpyhosttodevice (Device) Grid Block (0, 0) Shared Memory Register s Register s Block (1, 0) Shared Memory Register s Register s cudamemcpydevicetohost cudamemcpydevicetodevice The memory areas may not overlap Calling cudamemcpy() with dst and src pointers that do not match the direction of the copy results in an undefined behavior. Synchronous in CUDA will become asynchronous in future CUDA version? Host Thread (0, 0) Local Memor y Global Memory Constant Memory Texture Memory Thread (1, 0) Local Memor y Thread (0, 0) Local Memor y Thread (1, 0) Local Memor y 37

38 Access Times Register dedicated HW - single cycle Shared Memory dedicated HW - single cycle Local Memory DRAM, no cache - *slow* Global Memory DRAM, no cache - *slow* Constant Memory DRAM, cached, 1 10s 100s of cycles, depending on cache locality Texture Memory DRAM, cached, 1 10s 100s of cycles, depending on cache locality Instruction Memory (invisible) DRAM, cached 38

39 Blue Gene/L Full machine has 65,536 dual-core nodes, and each node has 32KB L1 instruction & data caches 4MB on-chip L3 cache 2.8 GFLOPS computation capacity with a 770MHz cloc 6 bidirectional ports to a 3-D torus interconnect 3 bidirectional ports to a collective network 4 ports to a barrier/interrupt network Figure 2.7 Logical organization of a BlueGene/L node. 39

40 Figure 2.8 BlueGene/L communication networks; (a) 3D 64x32x32 torus for standard interprocessor data transfer; (b) collective network for fast evaluation of reductions. Worst case latency = = 64 hops 40

41 Candidate Type Architecture (CTA) Differentiate between local vs. remote memory with respect to a processor Define λ = latency to access non-local memory relative to local memory access Locality rule: Fast programs tend to maximize the number of local memory references and minimize the number of non-local memory references 41

42 Table 2.1 Estimates for λ for common architectures; speeds generally do not include congestion or other traffic delays. 42

43 The Memory Wall 43

44 Conclusions Great diversity in parallel architectures A multi-dimensional space Memory model Communication latency Communication bandwidth Processing power Number of processors Issues of scale add to the diversity Many ways of balancing these characteristics 44 Impact on programmers This diversity complicates the task of programmers who care about portability and scalability

45 Announcement No class on Tuesday, Nov 3 rd We will meet next on Thursday, Nov 5 th Homework #2, due on Thursday, Nov 5 th Problem 8, Chapter 2, page 60 A single processor is a 0-cube, two processors connected are a 1-cube; given two n-cubes, connecting corresponding elements produces an (n+1)-cube. In an n-cube, what is the maximum length of the path required to connect two arbitrary nodes? Explain why. 45

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

Technology Trends Presentation For Power Symposium

Technology Trends Presentation For Power Symposium Technology Trends Presentation For Power Symposium 2006 8-23-06 Darryl Solie, Distinguished Engineer, Chief System Architect IBM Systems & Technology Group From Ingenuity to Impact Copyright IBM Corporation

More information

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers COMP 322: Fundamentals of Parallel Programming Lecture 37: General-Purpose GPU (GPGPU) Computing Max Grossman, Vivek Sarkar Department of Computer Science, Rice University max.grossman@rice.edu, vsarkar@rice.edu

More information

All About the Cell Processor

All About the Cell Processor All About the Cell H. Peter Hofstee, Ph. D. IBM Systems and Technology Group SCEI/Sony Toshiba IBM Design Center Austin, Texas Acknowledgements Cell is the result of a deep partnership between SCEI/Sony,

More information

OpenMP on the IBM Cell BE

OpenMP on the IBM Cell BE OpenMP on the IBM Cell BE PRACE Barcelona Supercomputing Center (BSC) 21-23 October 2009 Marc Gonzalez Tallada Index OpenMP programming and code transformations Tiling and Software Cache transformations

More information

Computer Systems Architecture I. CSE 560M Lecture 19 Prof. Patrick Crowley

Computer Systems Architecture I. CSE 560M Lecture 19 Prof. Patrick Crowley Computer Systems Architecture I CSE 560M Lecture 19 Prof. Patrick Crowley Plan for Today Announcement No lecture next Wednesday (Thanksgiving holiday) Take Home Final Exam Available Dec 7 Due via email

More information

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 12

More information

Introduction to CELL B.E. and GPU Programming. Agenda

Introduction to CELL B.E. and GPU Programming. Agenda Introduction to CELL B.E. and GPU Programming Department of Electrical & Computer Engineering Rutgers University Agenda Background CELL B.E. Architecture Overview CELL B.E. Programming Environment GPU

More information

Optimizing Data Sharing and Address Translation for the Cell BE Heterogeneous CMP

Optimizing Data Sharing and Address Translation for the Cell BE Heterogeneous CMP Optimizing Data Sharing and Address Translation for the Cell BE Heterogeneous CMP Michael Gschwind IBM T.J. Watson Research Center Cell Design Goals Provide the platform for the future of computing 10

More information

COMP 322: Principles of Parallel Programming. Lecture 17: Understanding Parallel Computers (Chapter 2) Fall 2009

COMP 322: Principles of Parallel Programming. Lecture 17: Understanding Parallel Computers (Chapter 2) Fall 2009 COMP 322: Principles of Parallel Programming Lecture 17: Understanding Parallel Computers (Chapter 2) Fall 2009 http://www.cs.rice.edu/~vsarkar/comp322 Vivek Sarkar Department of Computer Science Rice

More information

IBM Cell Processor. Gilbert Hendry Mark Kretschmann

IBM Cell Processor. Gilbert Hendry Mark Kretschmann IBM Cell Processor Gilbert Hendry Mark Kretschmann Architectural components Architectural security Programming Models Compiler Applications Performance Power and Cost Conclusion Outline Cell Architecture:

More information

Portable Parallel Programming for Multicore Computing

Portable Parallel Programming for Multicore Computing Portable Parallel Programming for Multicore Computing? Vivek Sarkar Rice University vsarkar@rice.edu FPU ISU ISU FPU IDU FXU FXU IDU IFU BXU U U IFU BXU L2 L2 L2 L3 D Acknowledgments Rice Habanero Multicore

More information

Spring 2011 Prof. Hyesoon Kim

Spring 2011 Prof. Hyesoon Kim Spring 2011 Prof. Hyesoon Kim PowerPC-base Core @3.2GHz 1 VMX vector unit per core 512KB L2 cache 7 x SPE @3.2GHz 7 x 128b 128 SIMD GPRs 7 x 256KB SRAM for SPE 1 of 8 SPEs reserved for redundancy total

More information

Massively Parallel Architectures

Massively Parallel Architectures Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI joel.falcou@lri.fr Bat. 490 - Bureau 104 20 janvier 2009 Motivation The CELL processor Harder,Better,Faster,Stronger

More information

COMP 322: Fundamentals of Parallel Programming

COMP 322: Fundamentals of Parallel Programming COMP 322: Fundamentals of Parallel Programming Lecture 38: General-Purpose GPU (GPGPU) Computing Guest Lecturer: Max Grossman Instructors: Vivek Sarkar, Mack Joyner Department of Computer Science, Rice

More information

Threading Hardware in G80

Threading Hardware in G80 ing Hardware in G80 1 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA 2 3D 3D API: API: OpenGL OpenGL or or Direct3D Direct3D GPU Command &

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

High Performance Computing. University questions with solution

High Performance Computing. University questions with solution High Performance Computing University questions with solution Q1) Explain the basic working principle of VLIW processor. (6 marks) The following points are basic working principle of VLIW processor. The

More information

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics CS4230 Parallel Programming Lecture 3: Introduction to Parallel Architectures Mary Hall August 28, 2012 Homework 1: Parallel Programming Basics Due before class, Thursday, August 30 Turn in electronically

More information

Chap. 2 part 1. CIS*3090 Fall Fall 2016 CIS*3090 Parallel Programming 1

Chap. 2 part 1. CIS*3090 Fall Fall 2016 CIS*3090 Parallel Programming 1 Chap. 2 part 1 CIS*3090 Fall 2016 Fall 2016 CIS*3090 Parallel Programming 1 Provocative question (p30) How much do we need to know about the HW to write good par. prog.? Chap. gives HW background knowledge

More information

CellSs Making it easier to program the Cell Broadband Engine processor

CellSs Making it easier to program the Cell Broadband Engine processor Perez, Bellens, Badia, and Labarta CellSs Making it easier to program the Cell Broadband Engine processor Presented by: Mujahed Eleyat Outline Motivation Architecture of the cell processor Challenges of

More information

The University of Texas at Austin

The University of Texas at Austin EE382N: Principles in Computer Architecture Parallelism and Locality Fall 2009 Lecture 24 Stream Processors Wrapup + Sony (/Toshiba/IBM) Cell Broadband Engine Mattan Erez The University of Texas at Austin

More information

Cell Broadband Engine. Spencer Dennis Nicholas Barlow

Cell Broadband Engine. Spencer Dennis Nicholas Barlow Cell Broadband Engine Spencer Dennis Nicholas Barlow The Cell Processor Objective: [to bring] supercomputer power to everyday life Bridge the gap between conventional CPU s and high performance GPU s History

More information

Evaluating the Portability of UPC to the Cell Broadband Engine

Evaluating the Portability of UPC to the Cell Broadband Engine Evaluating the Portability of UPC to the Cell Broadband Engine Dipl. Inform. Ruben Niederhagen JSC Cell Meeting CHAIR FOR OPERATING SYSTEMS Outline Introduction UPC Cell UPC on Cell Mapping Compiler and

More information

Roadrunner. By Diana Lleva Julissa Campos Justina Tandar

Roadrunner. By Diana Lleva Julissa Campos Justina Tandar Roadrunner By Diana Lleva Julissa Campos Justina Tandar Overview Roadrunner background On-Chip Interconnect Number of Cores Memory Hierarchy Pipeline Organization Multithreading Organization Roadrunner

More information

OpenMP on the IBM Cell BE

OpenMP on the IBM Cell BE OpenMP on the IBM Cell BE 15th meeting of ScicomP Barcelona Supercomputing Center (BSC) May 18-22 2009 Marc Gonzalez Tallada Index OpenMP programming and code transformations Tiling and Software cache

More information

Parallel Computing Platforms

Parallel Computing Platforms Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu)

More information

CUDA Programming Model

CUDA Programming Model CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Elements of a Parallel Computer Hardware Multiple processors Multiple

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

high performance medical reconstruction using stream programming paradigms

high performance medical reconstruction using stream programming paradigms high performance medical reconstruction using stream programming paradigms This Paper describes the implementation and results of CT reconstruction using Filtered Back Projection on various stream programming

More information

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin EE382 (20): Computer Architecture - ism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez The University of Texas at Austin 1 Recap 2 Streaming model 1. Use many slimmed down cores to run in parallel

More information

Portland State University ECE 588/688. Graphics Processors

Portland State University ECE 588/688. Graphics Processors Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly

More information

Revisiting Parallelism

Revisiting Parallelism Revisiting Parallelism Sudhakar Yalamanchili, Georgia Institute of Technology Where Are We Headed? MIPS 1000000 Multi-Threaded, Multi-Core 100000 Multi Threaded 10000 Era of Speculative, OOO 1000 Thread

More information

How to Write Fast Code , spring th Lecture, Mar. 31 st

How to Write Fast Code , spring th Lecture, Mar. 31 st How to Write Fast Code 18-645, spring 2008 20 th Lecture, Mar. 31 st Instructor: Markus Püschel TAs: Srinivas Chellappa (Vas) and Frédéric de Mesmay (Fred) Introduction Parallelism: definition Carrying

More information

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore CSCI-GA.3033-012 Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Status Quo Previously, CPU vendors

More information

Introduction to Computing and Systems Architecture

Introduction to Computing and Systems Architecture Introduction to Computing and Systems Architecture 1. Computability A task is computable if a sequence of instructions can be described which, when followed, will complete such a task. This says little

More information

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology CS8803SC Software and Hardware Cooperative Computing GPGPU Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology Why GPU? A quiet revolution and potential build-up Calculation: 367

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied

More information

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied

More information

Introduction to CUDA

Introduction to CUDA Introduction to CUDA Overview HW computational power Graphics API vs. CUDA CUDA glossary Memory model, HW implementation, execution Performance guidelines CUDA compiler C/C++ Language extensions Limitations

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

CS4961 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/30/11. Administrative UPDATE. Mary Hall August 30, 2011

CS4961 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/30/11. Administrative UPDATE. Mary Hall August 30, 2011 CS4961 Parallel Programming Lecture 3: Introduction to Parallel Architectures Administrative UPDATE Nikhil office hours: - Monday, 2-3 PM, MEB 3115 Desk #12 - Lab hours on Tuesday afternoons during programming

More information

Sony/Toshiba/IBM (STI) CELL Processor. Scientific Computing for Engineers: Spring 2008

Sony/Toshiba/IBM (STI) CELL Processor. Scientific Computing for Engineers: Spring 2008 Sony/Toshiba/IBM (STI) CELL Processor Scientific Computing for Engineers: Spring 2008 Nec Hercules Contra Plures Chip's performance is related to its cross section same area 2 performance (Pollack's Rule)

More information

Portland State University ECE 588/688. Cray-1 and Cray T3E

Portland State University ECE 588/688. Cray-1 and Cray T3E Portland State University ECE 588/688 Cray-1 and Cray T3E Copyright by Alaa Alameldeen 2014 Cray-1 A successful Vector processor from the 1970s Vector instructions are examples of SIMD Contains vector

More information

Parallel Computing. Hwansoo Han (SKKU)

Parallel Computing. Hwansoo Han (SKKU) Parallel Computing Hwansoo Han (SKKU) Unicore Limitations Performance scaling stopped due to Power consumption Wire delay DRAM latency Limitation in ILP 10000 SPEC CINT2000 2 cores/chip Xeon 3.0GHz Core2duo

More information

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved. Chapter 6 Parallel Processors from Client to Cloud FIGURE 6.1 Hardware/software categorization and examples of application perspective on concurrency versus hardware perspective on parallelism. 2 FIGURE

More information

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:

More information

COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors

COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors Edgar Gabriel Fall 2018 References Intel Larrabee: [1] L. Seiler, D. Carmean, E.

More information

Cell Programming Tips & Techniques

Cell Programming Tips & Techniques Cell Programming Tips & Techniques Course Code: L3T2H1-58 Cell Ecosystem Solutions Enablement 1 Class Objectives Things you will learn Key programming techniques to exploit cell hardware organization and

More information

CSE 392/CS 378: High-performance Computing - Principles and Practice

CSE 392/CS 378: High-performance Computing - Principles and Practice CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures A Conceptual Introduction for Software Developers Jim Browne browne@cs.utexas.edu Parallel Computer

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

Lecture 2: CUDA Programming

Lecture 2: CUDA Programming CS 515 Programming Language and Compilers I Lecture 2: CUDA Programming Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/12/2017 Review: Programming in CUDA Let s look at a sequential program in C first:

More information

CUDA OPTIMIZATIONS ISC 2011 Tutorial

CUDA OPTIMIZATIONS ISC 2011 Tutorial CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Fundamental Optimizations (GTC 2010) Paulius Micikevicius NVIDIA Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Optimization

More information

Compilation for Heterogeneous Platforms

Compilation for Heterogeneous Platforms Compilation for Heterogeneous Platforms Grid in a Box and on a Chip Ken Kennedy Rice University http://www.cs.rice.edu/~ken/presentations/heterogeneous.pdf Senior Researchers Ken Kennedy John Mellor-Crummey

More information

Parallel Systems I The GPU architecture. Jan Lemeire

Parallel Systems I The GPU architecture. Jan Lemeire Parallel Systems I The GPU architecture Jan Lemeire 2012-2013 Sequential program CPU pipeline Sequential pipelined execution Instruction-level parallelism (ILP): superscalar pipeline out-of-order execution

More information

CISC 879 Software Support for Multicore Architectures Spring Student Presentation 6: April 8. Presenter: Pujan Kafle, Deephan Mohan

CISC 879 Software Support for Multicore Architectures Spring Student Presentation 6: April 8. Presenter: Pujan Kafle, Deephan Mohan CISC 879 Software Support for Multicore Architectures Spring 2008 Student Presentation 6: April 8 Presenter: Pujan Kafle, Deephan Mohan Scribe: Kanik Sem The following two papers were presented: A Synchronous

More information

CS 179: GPU Computing. Lecture 2: The Basics

CS 179: GPU Computing. Lecture 2: The Basics CS 179: GPU Computing Lecture 2: The Basics Recap Can use GPU to solve highly parallelizable problems Performance benefits vs. CPU Straightforward extension to C language Disclaimer Goal for Week 1: Fast-paced

More information

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 10: Cache Coherence: Part I Parallel Computer Architecture and Programming Cache design review Let s say your code executes int x = 1; (Assume for simplicity x corresponds to the address 0x12345604

More information

Cell BE enabling density computing for data rich environments

Cell BE enabling density computing for data rich environments Cell BE enabling density computing for data rich environments Michael Gschwind Bruce D Amora Alexandre Eichenberger Cell Broadband Engine - enabling density computing for data-rich environments Cell History

More information

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization

More information

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel

More information

CS4961 Parallel Programming. Lecture 4: Memory Systems and Interconnects 9/1/11. Administrative. Mary Hall September 1, Homework 2, cont.

CS4961 Parallel Programming. Lecture 4: Memory Systems and Interconnects 9/1/11. Administrative. Mary Hall September 1, Homework 2, cont. CS4961 Parallel Programming Lecture 4: Memory Systems and Interconnects Administrative Nikhil office hours: - Monday, 2-3PM - Lab hours on Tuesday afternoons during programming assignments First homework

More information

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms Complexity and Advanced Algorithms Introduction to Parallel Algorithms Why Parallel Computing? Save time, resources, memory,... Who is using it? Academia Industry Government Individuals? Two practical

More information

Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies

Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies John C. Linford John Michalakes Manish Vachharajani Adrian Sandu IMAGe TOY 2009 Workshop 2 Virginia

More information

Numerical Simulation on the GPU

Numerical Simulation on the GPU Numerical Simulation on the GPU Roadmap Part 1: GPU architecture and programming concepts Part 2: An introduction to GPU programming using CUDA Part 3: Numerical simulation techniques (grid and particle

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance

More information

Core/Many-Core Architectures and Programming. Prof. Huiyang Zhou

Core/Many-Core Architectures and Programming.  Prof. Huiyang Zhou ST: CDA 6938 Multi-Core/Many Core/Many-Core Architectures and Programming http://csl.cs.ucf.edu/courses/cda6938/ Prof. Huiyang Zhou School of Electrical Engineering and Computer Science University of Central

More information

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer MIMD Overview Intel Paragon XP/S Overview! MIMDs in the 1980s and 1990s! Distributed-memory multicomputers! Intel Paragon XP/S! Thinking Machines CM-5! IBM SP2! Distributed-memory multicomputers with hardware

More information

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache. Multiprocessors why would you want a multiprocessor? Multiprocessors and Multithreading more is better? Cache Cache Cache Classifying Multiprocessors Flynn Taxonomy Flynn Taxonomy Interconnection Network

More information

Computer Architecture

Computer Architecture Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,

More information

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the

More information

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1 Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip

More information

BlueGene/L (No. 4 in the Latest Top500 List)

BlueGene/L (No. 4 in the Latest Top500 List) BlueGene/L (No. 4 in the Latest Top500 List) first supercomputer in the Blue Gene project architecture. Individual PowerPC 440 processors at 700Mhz Two processors reside in a single chip. Two chips reside

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #11 2/21/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline Midterm 1:

More information

CS/COE1541: Intro. to Computer Architecture

CS/COE1541: Intro. to Computer Architecture CS/COE1541: Intro. to Computer Architecture Multiprocessors Sangyeun Cho Computer Science Department Tilera TILE64 IBM BlueGene/L nvidia GPGPU Intel Core 2 Duo 2 Why multiprocessors? For improved latency

More information

Parallel Processing SIMD, Vector and GPU s cont.

Parallel Processing SIMD, Vector and GPU s cont. Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP

More information

4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.

4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4. Chapter 4: CPU 4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.8 Control hazard 4.14 Concluding Rem marks Hazards Situations that

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms http://sudalab.is.s.u-tokyo.ac.jp/~reiji/pna14/ [ 10 ] GPU and CUDA Parallel Numerical Algorithms / IST / UTokyo 1 PNA16 Lecture Plan General Topics 1. Architecture and Performance

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

Master Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms.

Master Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms. Master Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Systems and Enabling Platforms Marco Vanneschi Multithreading Contents Main features of explicit multithreading

More information

Spring Prof. Hyesoon Kim

Spring Prof. Hyesoon Kim Spring 2011 Prof. Hyesoon Kim 2 Warp is the basic unit of execution A group of threads (e.g. 32 threads for the Tesla GPU architecture) Warp Execution Inst 1 Inst 2 Inst 3 Sources ready T T T T One warp

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

CSE 160 Lecture 24. Graphical Processing Units

CSE 160 Lecture 24. Graphical Processing Units CSE 160 Lecture 24 Graphical Processing Units Announcements Next week we meet in 1202 on Monday 3/11 only On Weds 3/13 we have a 2 hour session Usual class time at the Rady school final exam review SDSC

More information

CUDA Performance Optimization. Patrick Legresley

CUDA Performance Optimization. Patrick Legresley CUDA Performance Optimization Patrick Legresley Optimizations Kernel optimizations Maximizing global memory throughput Efficient use of shared memory Minimizing divergent warps Intrinsic instructions Optimizations

More information

1. PowerPC 970MP Overview

1. PowerPC 970MP Overview 1. The IBM PowerPC 970MP reduced instruction set computer (RISC) microprocessor is an implementation of the PowerPC Architecture. This chapter provides an overview of the features of the 970MP microprocessor

More information

Chapter Seven Morgan Kaufmann Publishers

Chapter Seven Morgan Kaufmann Publishers Chapter Seven Memories: Review SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: value is stored as a charge on capacitor (must be

More information

Department of Computer Science, Institute for System Architecture, Operating Systems Group. Real-Time Systems '08 / '09. Hardware.

Department of Computer Science, Institute for System Architecture, Operating Systems Group. Real-Time Systems '08 / '09. Hardware. Department of Computer Science, Institute for System Architecture, Operating Systems Group Real-Time Systems '08 / '09 Hardware Marcus Völp Outlook Hardware is Source of Unpredictability Caches Pipeline

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs Lecture 5 Performance programming for stencil methods Vectorization Computing with GPUs Announcements Forge accounts: set up ssh public key, tcsh Turnin was enabled for Programming Lab #1: due at 9pm today,

More information

Processor Architecture and Interconnect

Processor Architecture and Interconnect Processor Architecture and Interconnect What is Parallelism? Parallel processing is a term used to denote simultaneous computation in CPU for the purpose of measuring its computation speeds. Parallel Processing

More information

ECE 574 Cluster Computing Lecture 15

ECE 574 Cluster Computing Lecture 15 ECE 574 Cluster Computing Lecture 15 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 30 March 2017 HW#7 (MPI) posted. Project topics due. Update on the PAPI paper Announcements

More information

ECE 574 Cluster Computing Lecture 17

ECE 574 Cluster Computing Lecture 17 ECE 574 Cluster Computing Lecture 17 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 March 2019 HW#8 (CUDA) posted. Project topics due. Announcements 1 CUDA installing On Linux

More information

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015 Lecture 10: Cache Coherence: Part I Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Marble House The Knife (Silent Shout) Before starting The Knife, we were working

More information