Simulation of OpenCL and APUs on Multi2Sim 4.1

Size: px
Start display at page:

Download "Simulation of OpenCL and APUs on Multi2Sim 4.1"

Transcription

1 Simulation of OpenCL and APUs on Multi2Sim 4.1 Rafael Ubal, David Kaeli Conference title 1

2 Outline Introduction Simulation methodology Part 1 Simulation of an x86 CPU Part 2 Simulation of a Southern Islands GPU Disassembler Emulation Timing simulation Memory hierarchy Visualization tool OpenCL from host to device Disassembler Emulation Timing simulation Visualization tool Case study: ND-Range virtualization Part 3 Concluding Remarks Additional Projects The Multi2Sim Community 2

3 Introduction Getting Started Follow our demos! User accounts for demos Machine: fusion1.ece.neu.edu User: isca1, isca2, isca3,... Password: isca2013 3

4 Introduction Getting Started Connect to our server $ ssh isca<n>@fusion1.ece.neu.edu -X (Notice the X forwarding for later demos using graphics) Demo descriptions $ ls demo1 demo2 demo3 demo4 demo5 demo6 demo7 README All files needed for each demo are present in its corresponding directory. README files describe commands to run and interpretation of outputs. Download and compile Multi2Sim $ $ $ $ wget tar -xzf multi2sim-4.1.tar.gz cd multi2sim-4.1./configure && make 4

5 Introduction First Execution Source code #include <stdio.h> int main(int argc, char **argv) { int i; printf("number of arguments: %d\n", argc); for (i = 0; i < argc; i++) printf("\targv[%d] = %s\n", i, argv[i]); return 0; } Native execution Execution on Multi2Sim $ test-args hello there $ m2s test-args hello there Number of arguments: 4 arg[0] = 'test-args' arg[1] = 'hello' arg[2] = 'there' < Simulator message in stderr > Number of arguments: 4 arg[0] = 'test-args' arg[1] = 'hello' arg[2] = 'there' < Simulator statistics > Demo 1 5

6 Introduction Simulator Input/Output Files Example of INI file format ; This is a comment. [ Section 0 ] Color = Red Height = 40 [ OtherSection ] Variable = Value Multi2Sim uses INI file for Configuration files. Output statistics. Statistic summary in standard error output. 6

7 Simulation Methodology Application-Only vs. Full-System Guest program 1 Virtualization of Complete processor ISA I/O hardware Guest program 2... Guest program 1 Full-system simulation An entire OS runs on top of the simulator. The simulator models the entire ISA, and virtualizes native hardware devices, similar to a virtual machine. Very accurate simulations, but extremely slow.... Virtualization of User-space subset of ISA System call interface Full O.S. Full-system simulator core Guest program 2 Application-only simulator core Application-only simulation Only an application runs on top of the simulator. The simulator implements a subset of the ISA, and needs to virtualize the system call interface (ABI). Multi2Sim falls in this category. 7

8 Simulation Methodology Four-Stage Simulation Process Exectuable ELF file Executable file, program arguments, processor configuration Exectuable file, program arguments Instruction bytes Run one instruction Disassembler Instruction fields Instructions dump Timing simulator (or detailed/ architectural) Emulator (or functional simulator) User interaction Pipeline trace Visual tool Instruction information Program output Performance statistics Cycle navigation, timing diagrams Modular implementation Four clearly different software modules per architecture (x86, MIPS,...) Each module has a standard interface for stand-alone execution, or interaction with other modules. 8

9 Simulation Methodology Current Architecture Support Disasm. Emulation Timing simulation Graphic pipelines ARM X In progress MIPS X X x86 X X X X AMD Evergreen X X X X AMD Southern Islands X X X X NVIDIA Fermi X In progress NVIDIA Kepler In progress In our latest Multi2Sim SVN repository 4 GPU + 3 CPU architectures supported or in progress. This tutorial will focus on x86 and AMD Southern Islands. 9

10 Part 1 Simulation of an x86 CPU 10

11 The x86 Disassembler Disassembler Emulator (or functional simulator) Timing simulator (or detailed/ architectural) Visual tool 11

12 The x86 Disassembler Methodology Implementation of an efficient instruction decoder based on lookup tables. When used as a stand-alone tool, the output is provided with exactly the same format as the GNU x86 disassembler for automatic verification. GNU x86 disassembler $ objdump -S -M intel test-args Multi2Sim x86 disassembler $ m2s --x86-disasm test-args Verification of common output <_start>: : : 5e : : : : a: b: ed e1 e4 f xor pop mov and push push push push ebp,ebp esi ecx,esp esp,0xfffffff0 eax esp edx 0x

13 The x86 Emulator Disassembler Emulator (or functional simulator) Timing simulator (or detailed/ architectural) Visual tool 13

14 The x86 Emulator Program Loading Initial virtual memory image Stack Program args. Env. variables mmap region Heap Initialized data 0x08xxxxxx Text Initialized data Instruction pointer 0x Read ELF sections and symbols. eax (not initialized) 0x ) Parse ELF executable Initialize code and data. Stack pointer 0xc Initial values for x86 registers ebx ecx 2) Initialize stack Program headers. esp Arguments. eip Environment variables. 3) Initialize registers Program entry eip Stack pointer esp 14

15 The x86 Emulator Emulation Loop Emulation of x86 instructions Read instr. at eip Update x86 registers. Instr. bytes Update memory map if needed. Decode instruction Example: add [bp+16], 0x5 Instr. fields No Instr. is int 0x80 Emulate x86 instr. Yes Emulation of Linux system calls Analyze system call code and arguments. Emulate system call Update memory map. Update register eax with return value. Example: read(fd, buf, count) Move eip to next instr. Demo 2 15

16 The x86 Timing Simulator Disassembler Emulator (or functional simulator) Timing simulator (or detailed/ architectural) Visual tool 16

17 The x86 Timing Simulator Superscalar Processor Fetch queue Instr. cache Uop queue Fetch Decode Trace cache Dispatch Reorder buffer Commit Instruction queue Issue ALU Load/store queue Trace queue Data cache Reg. file Writeback Superscalar x86 pipelines 6-stage pipeline with configurable latencies. Supported features include speculative execution, branch prediction, microinstruction generation, trace caches, out-of-order execution, Modeled structures include fetch queues, reorder buffer, load-store queues, register files, register mapping tables,... 17

18 The x86 Timing Simulator Multithreaded and Multicore Processors Multithreading n Shared resources: Reorder buffer Instruction queue Load/store queue Register file Functional units Nodes m to 2m - 1 m Hardware Thread Private resources: Fully replicated superscalar pipelines, connected through caches. s ad re th Multicore Node 0 Program counter Register aliasing table TLB Nodes (n 1)m to nm 1 Node 1 Memory hierarchy Superscalar Core Fine-grain, coarse-grain, and simultaneous multithreading. Running multiple programs concurrently, or one program spawning child threads (using OpenMP, pthread, etc.) s re co Replicated superscalar pipelines with partially shared resources. Node m 1 Demo 3 18

19 The x86 Timing Simulator Benchmark Support Single-threaded applications SPEC 2000 and SPEC 2006 benchmarks are fully supported. Pre-compiled x86 binaries are available on the website. The Mediabench suite includes program binaries and data files, with all you need to run them. Multithreaded applications SPLASH-2 benchmark suite with pre-compiled x86 executables and data files available on the website. PARSEC-2.1 with pre-compiled x86 executables and data files. 19

20 The Memory Hierarchy Configuration Flexible hierarchies Any number of caches organized in any number of levels. Cache levels connected through default cross-bar interconnects, or complex custom interconnect configurations. Each architecture undergoing a timing simulation specifies its own entry point (cache memory) in the memory hierarchy, for data or instructions. Cache coherence is guaranteed with an implementation of the 5-state MOESI protocol. 20

21 The Memory Hierarchy Configuration Examples Example 1 Three CPU cores with private L1 caches, two L2 caches, and default cross-bar based interconnects. Cache L2-0 serves physical address range [0, 7ff...ff], and cache L2-1 serves [ , ff...ff]. Core 0 Core 1 Core 2 L1-0 L1-1 L1-2 Switch L2-0 L2-1 Switch Main Memory Demo 4 21

22 The Memory Hierarchy Configuration Examples Core 0 Example 2 Four CPU cores with private L1 data caches, L1 instruction caches and L2 caches shared every 2 cores (serving the whole address space), and four main memory modules, connected with a custom network on a ring topology. Data L1-0 Core 1 Inst. L1-0 Data L1-1 Switch Core 2 Data L1-2 Core 3 Inst. L1-1 Data L1-3 Switch L2-0 L2-1 n0 n1 sw0 sw1 sw2 sw3 n2 n3 n4 n5 MM-0 MM-1 MM-2 MM-3 22

23 The Memory Hierarchy Configuration Examples Example 3 Ring connection between four switches associated with end-nodes with routing tables calculated automatically based on shortest paths. The resulting routing algorithm can contain cycles, potentially leading to routing deadlocks at runtime. n0 s0 s1 n1 n3 s3 s2 n2 23

24 The Memory Hierarchy Configuration Examples Example 4 Ring connection between for switches associated with end nodes, where a routing cycle has been removed by adding an additional virtual channel. n0 s0 s1 n1 s2 n2 Virtual Channel 0 n3 s3 Virtual Channel 1 24

25 Pipeline Visualization Tool Disassembler Emulator (or functional simulator) Timing simulator (or detailed/ architectural) Visual tool 25

26 Pipeline Visualization Tool Pipeline Diagrams Cycle bar on main window for navigation. Panel on main window shows software contexts mapped to hardware cores. Clicking on the Detail button opens a secondary window with a pipeline diagram. 26

27 Pipeline Visualization Tool Memory Hierarchy Panel on main window shows how memory accesses traverse the memory hierarchy. Clicking on a Detail button opens a secondary window with the cache memory representation. Each row is a set, each column is a way. Each cell shows the tag and state (color) of a cache block. Additional columns show the number of sharers and in-flight accesses. Demo 5 27

28 Part 2 Simulation of a Southern Islands GPU 28

29 OpenCL on the Host Execution Framework Multi2Sim 4.1 includes a new execution framework for OpenCL, developed in collaboration with University of Toronto. The new framework is a more accurate analogy to a native execution, and is fully AMD-compliant. When working with x86 kernel binaries, the OpenCL runtime can perform both native and simulated execution correctly. When run natively, an OpenCL call to clgetdeviceids returns only the x86 device. When run on Multi2Sim, clgetdeviceids returns one device per supported architecture: x86, Evergreen, and Southern Islands devices (more to be added). 29

30 OpenCL on the Host Execution Framework API call User-level code In each case, we compare native execution with simulated execution. User application Runtime library OS-level code The following slides show the modular organization of the OpenCL execution framework, based on 4 software/hardware entities. Device driver ABI call Internal interface Hardware 30

31 OpenCL on the Host The OpenCL CPU Host Program Native Multi2Sim An x86 OpenCL host program performs an OpenCL API call. Exact same scenario. 31

32 OpenCL on the Host The OpenCL Runtime Library Native Multi2Sim AMD's OpenCL runtime library handles the call, and communicates with the driver through system calls ioctl, read, write, etc. These are referred to as ABI calls. Multi2Sim's OpenCL runtime library, running with guest code, transparently intercepts the call. It communicates with the Multi2Sim driver using system calls with codes not reserved in Linux. 32

33 OpenCL on the Host The OpenCL Device Driver Native Multi2Sim The AMD Catalyst driver (kernel module) handles the ABI call and communicates with the GPU through the PCIe bus. An OpenCL driver module (Multi2Sim code) intercepts the ABI call and communicates with the GPU emulator. 33

34 OpenCL on the Host The GPU Emulator Native Multi2Sim The command processor in the GPU handles the messages received from the driver. The GPU emulator updates its internal state based on the message received from the driver. 34

35 OpenCL on the Host Transferring Control Beginning execution on the GPU The key OpenCL call that effectively triggers GPU execution is clenqueuendrangekernel. Order of events The host program performs API call clenqueuendrangekernel. The runtime intercepts the call, and enqueues a new task in an OpenCL command queue object. A user-level thread associated with the command queue eventually processes the command, performing a LaunchKernel ABI call. The driver intercepts the ABI call, reads ND-Range parameters, and launches the GPU emulator. The GPU emulator enters a simulation loop until the ND-Range completes. 35

36 OpenCL on the Device Execution Model Work-items execute multiple instances of the same kernel code. Work-groups are sets of work-items that can synchronize and communicate efficiently. The ND-Range is composed by all work-groups, not communicating with each other and executing in any order. ND-Range Work-Group Workgroup Workgroup Workitem Workitem Workgroup Workgroup Workitem Workitem Work-Item kernel func() { } Global Memory Local Memory Private Memory 36

37 OpenCL on the Device Execution Model Software-hardware mapping When the kernel is launched by the Southern Islands driver, the OpenCL ND-Range is mapped to the compute device (Fig. a). The work-groups are mapped to the compute units (Fig. b). The work-items are executed by the SIMD lanes (Fig. c). This is a simplified view of the architecture, explored later in more detail. Ultra-Threaded Dispatcher Compute Unit 1 Global Memory a) Compute device. Compute Unit 31 SIMD 3 0 Lane 15 Local Memory b) Compute unit. 15 Wavefront Scheduler Compute Unit 0 SIMD 0 0 Lane Functional Units (integer & float.) Register File c) SIMD lane (stream core). 37

38 The Southern Islands Disassembler Disassembler Emulator (or functional simulator) Timing simulator (or detailed/ architectural) Visual tool 38

39 The Southern Islands Disassembler SIMD Execution Motivation OpenCL follows a Single-Program Multiple-Data (SPMD) programming model, which maps to a Single-Instruction Multiple-Data (SIMD) execution model. A wavefront is a fixed set of work-items forming the SIMD execution unit. One single instruction is fetched, multiple instances of it are executed. If the wavefront is large... Amortize ports in the instruction cache and fetch hardware in general. If the wavefront is small... Easier to create work-groups with a size equal to an exact multiple of the wavefront size, reducing waste in the wavefront tail.. Reduce the effects of thread divergence in conditional execution or loops. Trade-off: 1 Wavefront = 64 Work-Items 39

40 The Southern Islands Disassembler Scalar Instructions Motivation Sometimes work-items within a wavefront do not only execute the same instruction, but also do it on the same data. The compiler can detect some of these cases and emit scalar instructions, issued to a special execution unit of much lower cost. Scalar opportunities Loading a base address of a buffer. Loading values from constant memory. Loop initialization, comparison, and increments. 40

41 The Southern Islands Disassembler Vector Addition Kernel Source code Scalar instructions Scalar registers Vector instructions kernel void vector_add( read_only global int *src1, read_only global int *src2, write_only global int *dst) { int id = get_global_id(0); dst[id] = src1[id] + src2[id]; } Assembly code s_buffer_load_dword s0, s[4:7], 0x04 s_buffer_load_dword s1, s[4:7], 0x18 s_buffer_load_dword s4, s[8:11], 0x00 s_buffer_load_dword s5, s[8:11], 0x04 s_buffer_load_dword s6, s[8:11], 0x08 s_load_dwordx4 s[8:11], s[2:3], 0x58 s_load_dwordx4 s[16:19], s[2:3], 0x60 s_load_dwordx4 s[20:23], s[2:3], 0x50 s_waitcnt lgkmcnt(0) s_min_u32 s0, s0, 0x0000ffff v_mov_b32 v1, s0 Vector registers v_mul_i32_i24 v1, s12, v1 v_add_i32 v0, vcc, v0, v1 v_add_i32 v0, vcc, s1, v0 v_lshlrev_b32 v0, 2, v0 v_add_i32 v1, vcc, s4, v0 v_add_i32 v2, vcc, s5, v0 v_add_i32 v0, vcc, s6, v0 tbuffer_load_format_x v1, v1, s[8:11], 0 tbuffer_load_format_x v2, v2, s[16:19], 0 s_waitcnt vmcnt(0) v_add_i32 v1, vcc, v1, v2 tbuffer_store_format_x v1, v0, s[20:23], 0 s_endpgm The loads The addition The store 41

42 The Southern Islands Disassembler Conditional Statements Source code kernel void if_kernel( global int *v) { uint id = get_global_id(0); if (id < 5) v[id] = 10; } Assembly code s_buffer_load_dword s0, s[8:11], 0x04 s_buffer_load_dword s1, s[8:11], 0x18 s_waitcnt lgkmcnt(0) s_min_u32 s0, s0, 0x0000ffff v_mov_b32 v1, s0 v_mul_i32_i24 v1, s16, v1 v_add_i32 v0, vcc, v0, v1 v_add_i32 v0, vcc, s1, v0 s_buffer_load_dword s0, s[12:15], 0x00 v_cmp_lt_u32 s[2:3], v0, 5 s_and_saveexec_b64 s[2:3], s[2:3] v_lshlrev_b32 v0, 2, v0 s_waitcnt lgkmcnt(0) v_add_i32 v0, vcc, s0, v0 v_mov_b32 v1, 10 tbuffer_store_format_x v1, v0, s[4:7], 0 s_mov_b64 exec, s[2:3] s_endpgm The comparison. Save active mask. Store value 10. Restore active mask. 42

43 The Southern Islands Emulator Disassembler Emulator (or functional simulator) Timing simulator (or detailed/ architectural) Visual tool 43

44 The Southern Islands Emulator Program Loading Responsible entity User application API call The device driver is the responsible module for setting up an initial state for the hardware, leaving it ready to run the first ISA instruction. Natively, it writes on hardware registers and global memory locations. On Multi2Sim, it calls initialization functions of the emulator. Runtime library ABI call Setup Instruction memories in compute units, each with one copy of the ISA section of the kernel binary. Device driver Internal interface Initial global memory image, copying global buffers from CPU to GPU memory. Kernel arguments. Hardware ND-Range topology, including number of dimensions and sizes. 44

45 The Southern Islands Emulator Split ND-Range into work-groups Emulation Loop Work-group pool Work-group execution Any work-groups left? Work-groups can execute in any order. This order is irrelevant for emulation purposes. No End Yes The chosen policy is executing one work-group at a time, in increasing order of ID for each dimension. Grab work-group and split in wavefronts Wavefront pool Wavefront execution Wavefronts within a work-group can also execute in any order, as long as synchronizations are considered. The chosen policy is executing one wavefront at a time until it hits a barrier, if any. No Any wavefront left? Yes For each running wavefront not stalled in a barrier: Read Emulate (update mem. + regs.) Advance PC Demo 6 45

46 Soutern Islands Timing Simulation Disassembler Emulator (or functional simulator) Timing simulator (or detailed/ architectural) Visual tool 46

47 Southern Islands Timing Simulation The GPU Architecture A command processor receives and processes commands from the host. When the ND-Range is created, an ultra-threaded dispatcher (scheduler) assigns work-groups into compute units while new available slots occur. The following slides present the architecture on each of the compute units. Command Processor Ultra-Threaded Dispatcher Compute Unit 0 Compute Unit 1 Compute Unit 31 L1 Cache L1 Cache L1 Cache Crossbar Main Memory Hierarchy (L2 caches, memory controllers, video memory) 47

48 Southern Islands Timing Simulation The Compute Unit The instruction memory of each compute unit contains the OpenCL kernel. A front-end fetches instructions and sends them to the appropriate execution unit. There is one scalar unit, vector-memory unit, branch unit, LDS (local data store) unit. There are multiple instances of SIMD units. 48

49 Southern Islands Timing Simulation The Front-End Work-groups are split into wavefronts and allocated to wavefront pools. Fetch and issue stages operate in a round-robin fashion. There is one SIMD unit associated to each wavefront pool. Wavefront Pool Fetch SIMD issue buffer, matching wavefront pool Scalar unit issue buffer Wavefront Pool Wavefront Pool Branch unit issue buffer Wavefront Pool Vector memory unit issue buffer LDS unit issue buffer Fetch buffers, one per wavefront pool Issue 49

50 Southern Islands Timing Simulation The SIMD Unit Runs arithmetic-logic vector instructions. There are 4 SIMD units, each one associated with one of the 4 wavefront pools. The SIMD unit pipeline is modeled with 5 stages: decode, read, execute, write, and complete. In the execute stage, a wavefront (64 work-items max.) is split into 4 subwavefronts (16 work-items each). Subwavefronts are pipelined over the 16 stream cores in 4 consecutive cycles. The vector register file is accessed in the read and write stages to consume input and produce output operands, respectively. 50

51 Southern Islands Timing Simulation From compute unit front-end The SIMD Unit Decode Issue buffer SIMD Lane 0 Decode buffer Read Pipelined Functional units Read buffer SIMD Lane 1 Work-items 1, 17, 33, 49 Write Execute buffer Write buffer SIMD Lane 15 Work-items 15, 31, 47, Vector/scalar register file Work-item 0 Work-item 16 Work-item 32 Work-item 48 Vector register file Execute Complete 51

52 Southern Islands Timing Simulation The Scalar Unit From compute unit front-end Runs both arithmetic-logic and memory scalar instructions. Modeled with 5 stages decode, read, execute/memory, write, complete. Decode Issue buffer Decode buffer Vector register file Execute Read Read buffer Memory Write Execute buffer Write buffer Scalar register file Complete 52

53 Southern Islands Timing Simulation The Vector-Memory Unit From compute unit front-end Runs vector memory instructions. Modeled with 5 stages decode, read, memory, write, complete. Accesses to the global memory hierarchy happen mainly in this unit. Decode Issue buffer Decode buffer Read Vector register file Memory Read buffer Write Memory buffer Write buffer Global memory Vector register file Complete 53

54 Southern Islands Timing Simulation The Branch Unit From compute unit front-end Runs branch instructions. These instructions decide whether to make an entire wavefront jump to a target address depending on the scalar condition code. Modeled with 5 stages decode, read, execute/memory, write, complete. Decode Issue buffer Decode buffer Read Scalar reg. file (program counter) Read buffer Write Execute buffer Write buffer Scalar register file (condition codes) Execute Complete 54

55 Southern Islands Timing Simulation The LDS (Local Data Store) Unit From compute unit front-end Runs local memory accesses instructions. Modeled with 5 stages decode, read, execute/memory, write, complete. The memory stage accesses the compute unit local memory for read/write. Decode Issue buffer Decode buffer Read Vector register file Memory Read buffer Write Memory buffer Write buffer Local memory Vector register file Complete 55

56 Southern Islands Timing Simulation Benchmark Support AMD OpenCL SDK Host programs are statically linked x86 binaries. The SDK is available for device kernels using three different architectures: Evergreen, Southern Islands, and x86. Packages include data files and execution commands. Other applications Other applications from the Rodinia and Parboil benchmark suite have been added support for by other users. Packages are under development. 56

57 Global Memory Hierarchy Organization Fully configurable memory hierarchy, with default values based on the AMD Radeon HD 7970 Southern Islands GPU. One 16KB data L1 per compute unit. One scalar L1 cache shared by every 4 compute units. Six L2 banks with a total size of 128KB, each connected to a DRAM module. CU 0 CU 1 CU 2 CU 3 L1 L1 L1 L1 Scalar cache... CU 28 CU 29 CU 30 CU 31 L1 L1 L1 L1 Scalar cache Interconnect L2 Bank 0 L2 Bank 1... L2 Bank

58 Global Memory Hierarchy Policies Inclusive, write-back, non-coherent caches. Non-coherence is implemented as a 3-state coherence protocol: NSI Blocks in N state are merged on write-back using write bit masks. Non-coherent Block can be L1 cache L1 cache accessed for read/write access, while shared with other caches. Shared Block is accessible for read access, possible shared with other caches. Invalid Block does not contain valid data Write back L2 cache Write back Cache block merged on writeback 58

59 Southern Islands Pipeline Visualization Disassembler Emulator (or functional simulator) Timing simulator (or detailed/ architectural) Visual tool 59

60 Southern Islands Pipeline Visualization Main Window and Timing Diagram Main window provides cycle-by-cycle navigation throughout simulation. A dedicated Southern Islands panel contains one widget per compute unit, showing allocated work-groups. The memory hierarchy panel shows caches connected to Southern Islands compute units, and special-purpose scalar caches. Demo 7 60

61 Case Study: ND-Range Virtualization Motivation Could be also called... Cooperative work-group execution. Next-level system heterogeneity. The idea The program writes one single OpenCL kernel. The OpenCL runtime creates one kernel binary targeting each architecture present in the system. The ND-Range is dynamically and automatically partitioned, and workgroups are spread across CPU and GPU cores. The result Data-parallel programming model complexity resides in the design of the OpenCL kernel, but not in cross-device scheduling. System resources (cores) better utilized and transparently managed. 61

62 Case Study: ND-Range Virtualization Current Heterogeneous Execution Work-group mapping Attained concurrency Complete ND-Range sent to one device, selected by the user with a call to clgetdeviceids. Host program runs on one x86 core, device kernel runs on Southern Islands compute units. Idle execution regions on x86 cores while Southern Islands compute units run the ND-Range. ND-Range WG-0 x86 WG-1 x86 WG-2 S.I. WG-3 S.I.... S.I. WG-N S.I. x86 x86 Host program Idle S.I. S.I. S.I. S.I. Hardware WG-0 x86 WG-1 x86 WG-2 S.I. WG-3 S.I.... S.I. WG-N S.I. Time ND-Range Device kernel Hardware 62

63 Case Study: ND-Range Virtualization Heterogeneous Execution with ND-Range Virtualization Work-group mapping Attained concurrency Portions of ND-Range executed by CPU/GPU cores with different ISAs. x86 cores run both the host program and a portion of the NDRange. Work-groups mapped to CPU cores or GPU compute units as they become available. Idle regions are removed during the execution of the ND-Range.. ND-Range WG-1 WG-2 WG-3... x86 S.I. S.I. S.I. S.I. WG-N Time WG-0 x86 x86 x86 S.I. S.I. S.I. S.I. Hardware Host Kernel program + kernel Kernel 63

64 Case Study: ND-Range Virtualization The Hardware Protocols MOESI and NSI merged into a single 6-state protocol: NMOESI. CPU and GPU cores with any ISA can be connected to different entry points of the memory hierarchy. Processing nodes interact with the memory hierarchy with three types of accesses: load, store, and n-store. x86 Evg. S.I. Fermi... NMOESI Interface L1 L1 L1 L1... Interconnect L2 Bank 0 L2 Bank GPUs and CPUs running OpenCL kernels issue n-store write accesses. Rest issue regular store accesses. ARM 64

65 Case Study: ND-Range Virtualization The Software The OpenCL host program is simplified. Only one command queue needed to exploit system heterogeneity. User application API call The runtime provides an additional virtual fused device, returned as a result of a call to clgetdeviceids. Runtime library ABI call The driver extends its interface to allow each physical device to run a portion of the NDRange, at the work-group granularity. Device driver Internal interface Hardware 65

66 Concluding Remarks 66

67 The Multi2Sim Community Additional Material The Multi2Sim Guide Complete documentation of the simulator's user interface, simulation models, and additional tools. Multi2Sim forums and mailing list New version releases and other important information is posted on the Multi2Sim mailing list (no spam, 1 per month). Users share question and knowledge on the website forum. M2S-Cluster Automatic verification framework for Multi2Sim. Based on a cluster of computers running condor. 67

68 The Multi2Sim Community Multi2C The Multi2Sim Compiler LLVM-based compiler for OpenCL and CUDA kernels. Future release Multi2Sim 4.2 will include a working version. Diagrams show progress as per SVN Rev Orange = under development Yellow = arriving soon Green = supported vec-add.cl OpenCL C to LLVM front-end vec-add.llvm vec-add.cu CUDA to LLVM front-end LLVM to Southern Islands back-end vec-add.s Southern Islands assembler vec-add.bin LLVM to Fermi back-end vec-add.s Fermi assembler vec-add.cubin LLVM to Kepler back-end vec-add.s Kepler assembler vec-add.cubin 68

69 The Multi2Sim Community Academic Efforts at Northeastern The GPU Programming and Architecture course We started an unofficial seminar that students can voluntarily attend. The syllabus covers OpenCL programming, GPU architecture, and state-of-theart research topics on GPUs. Average attendance of ~25 students per semester. Undergraduate directed studies Official alternative equivalent to a 4-credit course that an undergraduate student can optionally enroll in, collaborating in Multi2Sim development. Graduate-level development Lots of research projects at the graduate level depend are based on Multi2Sim, and selectively included in the development trunk for public access. Simulation of OpenGL pipelines, support for new CPU/GPU architectures, among others. 69

70 The Multi2Sim Community Collaborating Research Groups Universidad Politécnica de Valencia Pedro López, Salvador Petit, Julio Sahuquillo, José Duato. Northeastern University Chris Barton, Shu Chen, Zhongliang Chen, Tahir Diop, Xiang Gong, David Kaeli, Nicholas Materise, Perhaad Mistry, Dana Schaa, Rafael Ubal, Mark Wilkening, Ang Shen, Tushar Swamy, Amir Ziabari. University of Mississippi Byunghyun Jang NVIDIA Norm Rubin University of Toronto Jason Anderson, Natalie Enright, Steven Gurfinkel, Tahir Diop. University of Texas Rustam Miftakhutdinov 70

71 The Multi2Sim Community Multi2Sim Academic Publications Conference papers Multi2Sim: A Simulation Framework to Evaluate Multicore-Multithreaded Processors, SBAC-PAD, The Multi2Sim Simulation Framework: A CPU-GPU Model for Heterogeneous Computing, PACT, Tutorials The Multi2Sim Simulation Framework: A CPU-GPU Model for Heterogeneous Computing, PACT, Programming and Simulating Fused Devices OpenCL and Multi2Sim, ICPE, Multi-Architecture ISA-Level Simulation of OpenCL, IWOCL, Simulation of OpenCL and APUs on Multi2Sim, ISCA,

72 The Multi2Sim Community Published Academic Works Using Multi2Sim Recent R. Miftakhutdinov, E. Ebrahimi, Y. Patt, Predicting Performance Impact of DVFS for Realistic Memory Systems, MICRO, D. Lustig, M. Martonosi, Reducing GPU Offload Latency via Fine-Grained CPU-GPU Synchronization, HPCA, Other H. Calborean, R. Jahr, T. Ungerer, L. Vintan, A Comparison of Multi-objective Algorithms for the Automatic Design Space Exploration of a Superscalar System, Advances in Intelligent Systems and Computing, vol X. Li, C. Wang, X. Zhou, Z. Zhu, Cache Promotion Policy Using Re-reference Interval Prediction, CLUSTER, and 62 more citations, as per Google Scholar. 72

73 The Multi2Sim Community Sponsors 73

Multi-Architecture ISA-Level Simulation of OpenCL

Multi-Architecture ISA-Level Simulation of OpenCL Multi2Sim 4.1 Multi-Architecture ISA-Level Simulation of OpenCL Dana Schaa, Rafael Ubal Northeastern University Boston, MA Conference title 1 Outline Introduction Simulation methodology Part 1 Simulation

More information

A Framework for Visualization of OpenCL Applications Execution

A Framework for Visualization of OpenCL Applications Execution A Framework for Visualization of OpenCL Applications Execution A. Ziabari*, R. Ubal*, D. Schaa**, D. Kaeli* *NUCAR Group, Northeastern Universiy **AMD Conference title 1 Outline Introduction Simulation

More information

Visualization of OpenCL Application Execution on CPU-GPU Systems

Visualization of OpenCL Application Execution on CPU-GPU Systems Visualization of OpenCL Application Execution on CPU-GPU Systems A. Ziabari*, R. Ubal*, D. Schaa**, D. Kaeli* *NUCAR Group, Northeastern Universiy **AMD Northeastern University Computer Architecture Research

More information

Programming and Simulating Fused Devices. Part 2 Multi2Sim

Programming and Simulating Fused Devices. Part 2 Multi2Sim Programming and Simulating Fused Devices Part 2 Multi2Sim Rafael Ubal Perhaad Mistry Northeastern University Boston, MA Conference title 1 Outline 1. Introduction 2. The x86 CPU Emulation 3. The Evergreen

More information

Multi2sim Kepler: A Detailed Architectural GPU Simulator

Multi2sim Kepler: A Detailed Architectural GPU Simulator Multi2sim Kepler: A Detailed Architectural GPU Simulator Xun Gong, Rafael Ubal, David Kaeli Northeastern University Computer Architecture Research Lab Department of Electrical and Computer Engineering

More information

RUNTIME SUPPORT FOR ADAPTIVE SPATIAL PARTITIONING AND INTER-KERNEL COMMUNICATION ON GPUS

RUNTIME SUPPORT FOR ADAPTIVE SPATIAL PARTITIONING AND INTER-KERNEL COMMUNICATION ON GPUS RUNTIME SUPPORT FOR ADAPTIVE SPATIAL PARTITIONING AND INTER-KERNEL COMMUNICATION ON GPUS Yash Ukidave, Perhaad Mistry, Charu Kalra, Dana Schaa and David Kaeli Department of Electrical and Computer Engineering

More information

From Application to Technology OpenCL Application Processors Chung-Ho Chen

From Application to Technology OpenCL Application Processors Chung-Ho Chen From Application to Technology OpenCL Application Processors Chung-Ho Chen Computer Architecture and System Laboratory (CASLab) Department of Electrical Engineering and Institute of Computer and Communication

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

Multithreaded Processors. Department of Electrical Engineering Stanford University

Multithreaded Processors. Department of Electrical Engineering Stanford University Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread

More information

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant

More information

Implementing an efficient method of check-pointing on CPU-GPU

Implementing an efficient method of check-pointing on CPU-GPU Implementing an efficient method of check-pointing on CPU-GPU Harsha Sutaone, Sharath Prasad and Sumanth Suraneni Abstract In this paper, we describe the design, implementation, verification and analysis

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

Handout 3. HSAIL and A SIMT GPU Simulator

Handout 3. HSAIL and A SIMT GPU Simulator Handout 3 HSAIL and A SIMT GPU Simulator 1 Outline Heterogeneous System Introduction of HSA Intermediate Language (HSAIL) A SIMT GPU Simulator Summary 2 Heterogeneous System CPU & GPU CPU GPU CPU wants

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

CUDA OPTIMIZATIONS ISC 2011 Tutorial

CUDA OPTIMIZATIONS ISC 2011 Tutorial CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Cache Memory Access Patterns in the GPU Architecture

Cache Memory Access Patterns in the GPU Architecture Rochester Institute of Technology RIT Scholar Works Theses Thesis/Dissertation Collections 7-2018 Cache Memory Access Patterns in the GPU Architecture Yash Nimkar ypn4262@rit.edu Follow this and additional

More information

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University

More information

GPU Fundamentals Jeff Larkin November 14, 2016

GPU Fundamentals Jeff Larkin November 14, 2016 GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

ASSEMBLY LANGUAGE MACHINE ORGANIZATION ASSEMBLY LANGUAGE MACHINE ORGANIZATION CHAPTER 3 1 Sub-topics The topic will cover: Microprocessor architecture CPU processing methods Pipelining Superscalar RISC Multiprocessing Instruction Cycle Instruction

More information

CS 654 Computer Architecture Summary. Peter Kemper

CS 654 Computer Architecture Summary. Peter Kemper CS 654 Computer Architecture Summary Peter Kemper Chapters in Hennessy & Patterson Ch 1: Fundamentals Ch 2: Instruction Level Parallelism Ch 3: Limits on ILP Ch 4: Multiprocessors & TLP Ap A: Pipelining

More information

Regression Modelling of Power Consumption for Heterogeneous Processors. Tahir Diop

Regression Modelling of Power Consumption for Heterogeneous Processors. Tahir Diop Regression Modelling of Power Consumption for Heterogeneous Processors by Tahir Diop A thesis submitted in conformity with the requirements for the degree of Master of Applied Science Graduate Department

More information

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU 1-6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU Product Overview Introduction 1. ARCHITECTURE OVERVIEW The Cyrix 6x86 CPU is a leader in the sixth generation of high

More information

Register File Organization

Register File Organization Register File Organization Sudhakar Yalamanchili unless otherwise noted (1) To understand the organization of large register files used in GPUs Objective Identify the performance bottlenecks and opportunities

More information

COSC 6385 Computer Architecture. - Multi-Processors (V) The Intel Larrabee, Nvidia GT200 and Fermi processors

COSC 6385 Computer Architecture. - Multi-Processors (V) The Intel Larrabee, Nvidia GT200 and Fermi processors COSC 6385 Computer Architecture - Multi-Processors (V) The Intel Larrabee, Nvidia GT200 and Fermi processors Fall 2012 References Intel Larrabee: [1] L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M.

More information

Crusoe Reference. What is Binary Translation. What is so hard about it? Thinking Outside the Box The Transmeta Crusoe Processor

Crusoe Reference. What is Binary Translation. What is so hard about it? Thinking Outside the Box The Transmeta Crusoe Processor Crusoe Reference Thinking Outside the Box The Transmeta Crusoe Processor 55:132/22C:160 High Performance Computer Architecture The Technology Behind Crusoe Processors--Low-power -Compatible Processors

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied

More information

Practical Malware Analysis

Practical Malware Analysis Practical Malware Analysis Ch 4: A Crash Course in x86 Disassembly Revised 1-16-7 Basic Techniques Basic static analysis Looks at malware from the outside Basic dynamic analysis Only shows you how the

More information

OPENCL GPU BEST PRACTICES BENJAMIN COQUELLE MAY 2015

OPENCL GPU BEST PRACTICES BENJAMIN COQUELLE MAY 2015 OPENCL GPU BEST PRACTICES BENJAMIN COQUELLE MAY 2015 TOPICS Data transfer Parallelism Coalesced memory access Best work group size Occupancy branching All the performance numbers come from a W8100 running

More information

Multi-core Architectures. Dr. Yingwu Zhu

Multi-core Architectures. Dr. Yingwu Zhu Multi-core Architectures Dr. Yingwu Zhu Outline Parallel computing? Multi-core architectures Memory hierarchy Vs. SMT Cache coherence What is parallel computing? Using multiple processors in parallel to

More information

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache. Multiprocessors why would you want a multiprocessor? Multiprocessors and Multithreading more is better? Cache Cache Cache Classifying Multiprocessors Flynn Taxonomy Flynn Taxonomy Interconnection Network

More information

Parallel Programming on Larrabee. Tim Foley Intel Corp

Parallel Programming on Larrabee. Tim Foley Intel Corp Parallel Programming on Larrabee Tim Foley Intel Corp Motivation This morning we talked about abstractions A mental model for GPU architectures Parallel programming models Particular tools and APIs This

More information

Martin Kruliš, v

Martin Kruliš, v Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal

More information

SPECULATIVE MULTITHREADED ARCHITECTURES

SPECULATIVE MULTITHREADED ARCHITECTURES 2 SPECULATIVE MULTITHREADED ARCHITECTURES In this Chapter, the execution model of the speculative multithreading paradigm is presented. This execution model is based on the identification of pairs of instructions

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied

More information

Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Active thread Idle thread

Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Active thread Idle thread Intra-Warp Compaction Techniques Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Goal Active thread Idle thread Compaction Compact threads in a warp to coalesce (and eliminate)

More information

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CHRISTOS KOZYRAKIS STANFORD ISCA-40 JUNE 27, 2013 Introduction 2 Current detailed simulators are slow (~200

More information

WHY PARALLEL PROCESSING? (CE-401)

WHY PARALLEL PROCESSING? (CE-401) PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:

More information

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CHRISTOS KOZYRAKIS STANFORD ISCA-40 JUNE 27, 2013 Introduction 2 Current detailed simulators are slow (~200

More information

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Fundamental Optimizations (GTC 2010) Paulius Micikevicius NVIDIA Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Optimization

More information

Compilers and Code Optimization EDOARDO FUSELLA

Compilers and Code Optimization EDOARDO FUSELLA Compilers and Code Optimization EDOARDO FUSELLA Contents LLVM The nu+ architecture and toolchain LLVM 3 What is LLVM? LLVM is a compiler infrastructure designed as a set of reusable libraries with well

More information

" # " $ % & ' ( ) * + $ " % '* + * ' "

 #  $ % & ' ( ) * + $  % '* + * ' ! )! # & ) * + * + * & *,+,- Update Instruction Address IA Instruction Fetch IF Instruction Decode ID Execute EX Memory Access ME Writeback Results WB Program Counter Instruction Register Register File

More information

HSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017!

HSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017! Advanced Topics on Heterogeneous System Architectures HSA Foundation! Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017! Antonio R. Miele! Marco D. Santambrogio! Politecnico di Milano! 2

More information

Multi-threaded processors. Hung-Wei Tseng x Dean Tullsen

Multi-threaded processors. Hung-Wei Tseng x Dean Tullsen Multi-threaded processors Hung-Wei Tseng x Dean Tullsen OoO SuperScalar Processor Fetch instructions in the instruction window Register renaming to eliminate false dependencies edule an instruction to

More information

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin EE382 (20): Computer Architecture - ism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez The University of Texas at Austin 1 Recap 2 Streaming model 1. Use many slimmed down cores to run in parallel

More information

Architectural and Runtime Enhancements for Dynamically Controlled Multi-Level Concurrency on GPUs

Architectural and Runtime Enhancements for Dynamically Controlled Multi-Level Concurrency on GPUs Architectural and Runtime Enhancements for Dynamically Controlled Multi-Level Concurrency on GPUs A Dissertation Presented by Yash Ukidave to The Department of Electrical and Computer Engineering in partial

More information

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need?? Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross

More information

Introduction. L25: Modern Compiler Design

Introduction. L25: Modern Compiler Design Introduction L25: Modern Compiler Design Course Aims Understand the performance characteristics of modern processors Be familiar with strategies for optimising dynamic dispatch for languages like JavaScript

More information

Assembly Language. Lecture 2 - x86 Processor Architecture. Ahmed Sallam

Assembly Language. Lecture 2 - x86 Processor Architecture. Ahmed Sallam Assembly Language Lecture 2 - x86 Processor Architecture Ahmed Sallam Introduction to the course Outcomes of Lecture 1 Always check the course website Don t forget the deadline rule!! Motivations for studying

More information

CS152 Computer Architecture and Engineering. Complex Pipelines

CS152 Computer Architecture and Engineering. Complex Pipelines CS152 Computer Architecture and Engineering Complex Pipelines Assigned March 6 Problem Set #3 Due March 20 http://inst.eecs.berkeley.edu/~cs152/sp12 The problem sets are intended to help you learn the

More information

Assembly Language. Lecture 2 x86 Processor Architecture

Assembly Language. Lecture 2 x86 Processor Architecture Assembly Language Lecture 2 x86 Processor Architecture Ahmed Sallam Slides based on original lecture slides by Dr. Mahmoud Elgayyar Introduction to the course Outcomes of Lecture 1 Always check the course

More information

CS Computer Architecture

CS Computer Architecture CS 35101 Computer Architecture Section 600 Dr. Angela Guercio Fall 2010 An Example Implementation In principle, we could describe the control store in binary, 36 bits per word. We will use a simple symbolic

More information

Processors, Performance, and Profiling

Processors, Performance, and Profiling Processors, Performance, and Profiling Architecture 101: 5-Stage Pipeline Fetch Decode Execute Memory Write-Back Registers PC FP ALU Memory Architecture 101 1. Fetch instruction from memory. 2. Decode

More information

OpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania

OpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania OpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania Course Overview This OpenCL base course is structured as follows: Introduction to GPGPU programming, parallel programming

More information

HSA foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room A. Alario! 23 November, 2015!

HSA foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room A. Alario! 23 November, 2015! Advanced Topics on Heterogeneous System Architectures HSA foundation! Politecnico di Milano! Seminar Room A. Alario! 23 November, 2015! Antonio R. Miele! Marco D. Santambrogio! Politecnico di Milano! 2

More information

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer

More information

Exploitation of instruction level parallelism

Exploitation of instruction level parallelism Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

Portland State University ECE 588/688. Graphics Processors

Portland State University ECE 588/688. Graphics Processors Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly

More information

CUDA GPGPU Workshop 2012

CUDA GPGPU Workshop 2012 CUDA GPGPU Workshop 2012 Parallel Programming: C thread, Open MP, and Open MPI Presenter: Nasrin Sultana Wichita State University 07/10/2012 Parallel Programming: Open MP, MPI, Open MPI & CUDA Outline

More information

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25 CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25 http://inst.eecs.berkeley.edu/~cs152/sp08 The problem

More information

OpenCL. Matt Sellitto Dana Schaa Northeastern University NUCAR

OpenCL. Matt Sellitto Dana Schaa Northeastern University NUCAR OpenCL Matt Sellitto Dana Schaa Northeastern University NUCAR OpenCL Architecture Parallel computing for heterogenous devices CPUs, GPUs, other processors (Cell, DSPs, etc) Portable accelerated code Defined

More information

GPUs have enormous power that is enormously difficult to use

GPUs have enormous power that is enormously difficult to use 524 GPUs GPUs have enormous power that is enormously difficult to use Nvidia GP100-5.3TFlops of double precision This is equivalent to the fastest super computer in the world in 2001; put a single rack

More information

Numerical Simulation on the GPU

Numerical Simulation on the GPU Numerical Simulation on the GPU Roadmap Part 1: GPU architecture and programming concepts Part 2: An introduction to GPU programming using CUDA Part 3: Numerical simulation techniques (grid and particle

More information

Pentium IV-XEON. Computer architectures M

Pentium IV-XEON. Computer architectures M Pentium IV-XEON Computer architectures M 1 Pentium IV block scheme 4 32 bytes parallel Four access ports to the EU 2 Pentium IV block scheme Address Generation Unit BTB Branch Target Buffer I-TLB Instruction

More information

Lecture 26: Parallel Processing. Spring 2018 Jason Tang

Lecture 26: Parallel Processing. Spring 2018 Jason Tang Lecture 26: Parallel Processing Spring 2018 Jason Tang 1 Topics Static multiple issue pipelines Dynamic multiple issue pipelines Hardware multithreading 2 Taxonomy of Parallel Architectures Flynn categories:

More information

OpenCL TM & OpenMP Offload on Sitara TM AM57x Processors

OpenCL TM & OpenMP Offload on Sitara TM AM57x Processors OpenCL TM & OpenMP Offload on Sitara TM AM57x Processors 1 Agenda OpenCL Overview of Platform, Execution and Memory models Mapping these models to AM57x Overview of OpenMP Offload Model Compare and contrast

More information

Portland State University ECE 588/688. Cray-1 and Cray T3E

Portland State University ECE 588/688. Cray-1 and Cray T3E Portland State University ECE 588/688 Cray-1 and Cray T3E Copyright by Alaa Alameldeen 2014 Cray-1 A successful Vector processor from the 1970s Vector instructions are examples of SIMD Contains vector

More information

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011 5-740/8-740 Computer Architecture Lecture 0: Out-of-Order Execution Prof. Onur Mutlu Carnegie Mellon University Fall 20, 0/3/20 Review: Solutions to Enable Precise Exceptions Reorder buffer History buffer

More information

Spring Prof. Hyesoon Kim

Spring Prof. Hyesoon Kim Spring 2011 Prof. Hyesoon Kim 2 Warp is the basic unit of execution A group of threads (e.g. 32 threads for the Tesla GPU architecture) Warp Execution Inst 1 Inst 2 Inst 3 Sources ready T T T T One warp

More information

The Distribution of OpenCL Kernel Execution Across Multiple Devices. Steven Gurfinkel

The Distribution of OpenCL Kernel Execution Across Multiple Devices. Steven Gurfinkel The Distribution of OpenCL Kernel Execution Across Multiple Devices by Steven Gurfinkel A thesis submitted in conformity with the requirements for the degree of Master of Applied Science Graduate Department

More information

UNIT- 5. Chapter 12 Processor Structure and Function

UNIT- 5. Chapter 12 Processor Structure and Function UNIT- 5 Chapter 12 Processor Structure and Function CPU Structure CPU must: Fetch instructions Interpret instructions Fetch data Process data Write data CPU With Systems Bus CPU Internal Structure Registers

More information

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 250P: Computer Systems Architecture Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction and instr

More information

Data Parallel Execution Model

Data Parallel Execution Model CS/EE 217 GPU Architecture and Parallel Programming Lecture 3: Kernel-Based Data Parallel Execution Model David Kirk/NVIDIA and Wen-mei Hwu, 2007-2013 Objective To understand the organization and scheduling

More information

Auto-tunable GPU BLAS

Auto-tunable GPU BLAS Auto-tunable GPU BLAS Jarle Erdal Steinsland Master of Science in Computer Science Submission date: June 2011 Supervisor: Anne Cathrine Elster, IDI Norwegian University of Science and Technology Department

More information

Handout 2 ILP: Part B

Handout 2 ILP: Part B Handout 2 ILP: Part B Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP

More information

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs Chapter 5 (Part II) Baback Izadi Division of Engineering Programs bai@engr.newpaltz.edu Virtual Machines Host computer emulates guest operating system and machine resources Improved isolation of multiple

More information

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism

More information

Hardware-based Speculation

Hardware-based Speculation Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions

More information

Lecture-13 (ROB and Multi-threading) CS422-Spring

Lecture-13 (ROB and Multi-threading) CS422-Spring Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue

More information

Parallel Processing SIMD, Vector and GPU s cont.

Parallel Processing SIMD, Vector and GPU s cont. Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP

More information

A Code Merging Optimization Technique for GPU. Ryan Taylor Xiaoming Li University of Delaware

A Code Merging Optimization Technique for GPU. Ryan Taylor Xiaoming Li University of Delaware A Code Merging Optimization Technique for GPU Ryan Taylor Xiaoming Li University of Delaware FREE RIDE MAIN FINDING A GPU program can use the spare resources of another GPU program without hurting its

More information

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture

More information

These slides do not give detailed coverage of the material. See class notes and solved problems (last page) for more information.

These slides do not give detailed coverage of the material. See class notes and solved problems (last page) for more information. 11 1 This Set 11 1 These slides do not give detailed coverage of the material. See class notes and solved problems (last page) for more information. Text covers multiple-issue machines in Chapter 4, but

More information

Exploring different level of parallelism Instruction-level parallelism (ILP): how many of the operations/instructions in a computer program can be performed simultaneously 1. e = a + b 2. f = c + d 3.

More information

Multi-core Architectures. Dr. Yingwu Zhu

Multi-core Architectures. Dr. Yingwu Zhu Multi-core Architectures Dr. Yingwu Zhu What is parallel computing? Using multiple processors in parallel to solve problems more quickly than with a single processor Examples of parallel computing A cluster

More information

Fundamental Optimizations

Fundamental Optimizations Fundamental Optimizations Paulius Micikevicius NVIDIA Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010 Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access

More information

DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK

DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK SUBJECT : CS6303 / COMPUTER ARCHITECTURE SEM / YEAR : VI / III year B.E. Unit I OVERVIEW AND INSTRUCTIONS Part A Q.No Questions BT Level

More information

Introducing Multi-core Computing / Hyperthreading

Introducing Multi-core Computing / Hyperthreading Introducing Multi-core Computing / Hyperthreading Clock Frequency with Time 3/9/2017 2 Why multi-core/hyperthreading? Difficult to make single-core clock frequencies even higher Deeply pipelined circuits:

More information

Manycore Processors. Manycore Chip: A chip having many small CPUs, typically statically scheduled and 2-way superscalar or scalar.

Manycore Processors. Manycore Chip: A chip having many small CPUs, typically statically scheduled and 2-way superscalar or scalar. phi 1 Manycore Processors phi 1 Definition Manycore Chip: A chip having many small CPUs, typically statically scheduled and 2-way superscalar or scalar. Manycore Accelerator: [Definition only for this

More information

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1 CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level

More information

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown

More information

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1) Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized for most of the time by doubling

More information

Computer Organization and Design, 5th Edition: The Hardware/Software Interface

Computer Organization and Design, 5th Edition: The Hardware/Software Interface Computer Organization and Design, 5th Edition: The Hardware/Software Interface 1 Computer Abstractions and Technology 1.1 Introduction 1.2 Eight Great Ideas in Computer Architecture 1.3 Below Your Program

More information

TDT 4260 lecture 7 spring semester 2015

TDT 4260 lecture 7 spring semester 2015 1 TDT 4260 lecture 7 spring semester 2015 Lasse Natvig, The CARD group Dept. of computer & information science NTNU 2 Lecture overview Repetition Superscalar processor (out-of-order) Dependencies/forwarding

More information

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling) 18-447 Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling) Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 2/13/2015 Agenda for Today & Next Few Lectures

More information