Compilers and Code Optimization EDOARDO FUSELLA

Size: px
Start display at page:

Download "Compilers and Code Optimization EDOARDO FUSELLA"

Transcription

1 Compilers and Code Optimization EDOARDO FUSELLA

2 Contents LLVM The nu+ architecture and toolchain

3 LLVM 3

4 What is LLVM? LLVM is a compiler infrastructure designed as a set of reusable libraries with well defined interfaces Implemented in C++ Several front ends Several back ends First release: 2003 Open source

5 LLVM is a Compilation Infra Structure It is a framework that comes with lots of tools to compile and optimize code.

6 LLVM vs GCC clang/clang++ are very competitive when compared with gcc. Some compilers are faster in some benchmarks, and slower in others. Usually clang/clang++ have faster compilation times.

7 Why to Learn LLVM? Intensively used in the academia Used by many companies LLVM is maintained by Apple. ARM, NVIDIA, Mozilla, Cray, etc Clean and modular interfaces. Open source LLVM implements the entire compilation flow. Front end, e.g., clang & clang++ Middle end, e.g., analyses and optimizations Back end, e.g., different computer architectures

8 LLVM compilation flow Like gcc, clang supports different levels of optimizations, e.g., O0 (default), O1, O2 and O3

9 LLVM Intermediate Representation Example taken from the slides of Gennady Pekhimenko "The LLVM Compiler Framework and Infrastructure" LLVM represents programs, internally, via its own instruction set The LLVM optimizations manipulate these bytecodes. We can program directly on them. We can also interpret them.

10 LLVM Bytecodes are Interpretable Bytecode is a form of instruction set designed for efficient execution by a software interpreter. They are portable! Example: Java bytecodes. The tool lli directly executes programs in LLVM bitcode format. lli may compile these bytecodes just in time, if a JIT is available.

11 How Does the LLVM IR Look Like? RISC instruction set, with usual opcodes add, mul, or, shir, branch, load, store, etc Typed representation. Static Single Assignment format Compared to three-address code, all assignments in SSA are to variables with distinct names; hence the term static singleassigment.

12 Generating Machine Code Once we have optimized the intermediate program, we can translate it to machine code. In LLVM, we use the llc tool to perform this translation. This tool is able to target many different architectures

13 13 Nu+

14 The Nu+ processor: current state Hardware: ~18000 lines of System Verilog code Two versions: multi-core and single-core Hardware multi-threading Scalar and vector operations (SIMD) Dynamic instruction scheduling (simple scoreboard) ISA consolidated 32- and 64-bit operations Masked operations, also used for control flow Rollback stage (involves branches and loops) High-performance cache hierarchy, support for DDR3 Private L1 cache for each core Shared distributed L2 cache with coherence directory-based protocol Non-coherent scratchpad memory Handling variable latencies of SPM and Writeback stages LUT, FFs, 102 BRAMs, 146 DSP (1 core/8 threads, 16 HW Lanes, 64 registers per thread, Caches: 512 bit/4 ways/128 sets): resp, 8%, 5%, 8%, 6% on the Virtex7 MHz Integrated with MANGO Software/Compiler toolchain LLVM-based toolchain Builtins exposed to the C/C++ programmer Integrated with MANGOLIB Polyhedral analysis (require external tools) 14

15 Nu+ hardware project SoC-like organization: Core + Memory + IO Devices Proprietary bus, with a MANGO-like interface Our bus to AXI bridge to connect AXI-compliant devices Completely written from scratch in System Verilog 15

16 Nu+ current microarchitecture Can configure Number of cores Number of Threads Number of hw lanes Number of registers per Thread Cache set-size Number of ways Number of 32-bit words in each line SPM parameters: Number/size of banks Type of partitioning 16

17 Nu+ Configurability Highly parameterizable Thread numbers per Core L1 and L2 Cache configuration options Hardware SIMD lanes per thread Number of registers Scratchpad Memory parameters IO Memory map address space 17

18 Nu+ Scratch-Pad Memory Default parameters: SPM size: 8kB Data bus width: 512 bit (16 lanes accessing 32 bit operands) In absence of conflicts, 16 concurrent accesses Ultra-low latency In absence of conflicts, 3 clock cycles A. Cilardo, M. Gagliardi, C. Donnarumma, "A Configurable Shared Scratchpad Memory for GPU-like Processors", Procs. of the International Conference on P2P, Parallel, Grid, Cloud and Internet Computing, Springer, pp 3-14,

19 Nu+ Register file Large register file 58 general purpose 32-bit scalar registers S 0 - S 57 configurable into bit registers. 6 special registers trap register TR mask register RM frame pointer FP stack pointer SP return address RA 64 general purpose 512-bit vector registers V 0 - V 63

20 Nu+ Instruction formats 1/2 Instructions are encoded in eight 32-bit formats: R arithmetic operations with Register/Register Encoding I arithmetic operations with Immediate Encoding two registers and a 9-bit immediate value MOVEI MOVE instructions with a 16-bit immediate value C control operations (such as cache control) J jump operations M memory operations Main memory and scratchpad memory

21 Nu+ Instruction formats 2/2 Bits 31-24: the most significant 8 bits are used to encode the format + opcode Bit M is used in case of masked instructions Bits FMT are used to specify if a certain operand is a scalar or a vector (one bit for every register in the format) Bit L is high in case of "long" operations, i.e. operations that require long integers or double precision numbers Bit S is high in case a load/store operation accesses the scratchpad memory

22 Nu+ toolchain

23 Features Some interesting features handled by the compiler: Native support for 32-bit and 64-bit operations, either floating point operations (IEEE-754 compliant) or integer operations. Native support for complex arithmetic and vector instructions (SIMD). Vector instructions can be masked in order to operate on a subset of the vector elements. Native support to the scratchpad memory through specific load/store instructions.

24 Arithmetic The nu+ execution pipeline supports simultaneously: 16 single-precision floating point operations (IEEE-754 compliant) or 32-bit integer operations. 8 double-precision floating point operations (IEEE-754 compliant) or 64-bit integer operations. Each vector lane has its own 32-bit operator In case of a operation on 64-bit values, the bit L (instruction format) must be set to one, adjacent lane pairs are merged in a single 64-bit wide operator

25 Vector arithmetic The nu+ architecture includes A separate vector register file with bit vector registers V 0 - V 63 Each register is configurable to store vectors of bit elements 8 64-bit elements. In addition, it is also possible to store vectors of bit and 8-bit elements 8 32-bit, 16-bit and 8-bit elements. 16x32 and 8x64 are natively supported by the hardware while the others require a conversion/extension/truncation. Native types when targeting performance Non-native types when targeting memory footprint

26 stdint.h We redefined the stdint.h header file in order to provide a set of typedefs that specify vector types OpenCL-compliant: vector types are created using ext_vector_type attribute

27 libc: custom implementation Custom version of the following standard C libraries: ctype.h math.h stdlib.h Except dynamic memory management (calloc, free, malloc, realloc) and environment (abort, atexit, at_quick_exit, exit, getenv, quick_exit, system) functions string.h

28 Infeasible to show the backend code in this presentation A few examples will show some interesting aspects of the nu+ architecture/toolchain Note that the following code is generated without any optimization O0 Nu+ is an open-source project and the whole compiler will be soon available at

29 NuPlusRegisterInfo.td Declarations

30 NuPlusRegisterInfo.td Registers

31 NuPlusRegisterInfo.td Register Classes

32 NuPlusInstrFormats.td class FR

33 NuPlusInstrInfo.td Arithmetic Integer Two operands Defined in NuPlusInstrFormats.td

34 32-bit scalar constants We never use the constant pool in case of scalar constants We rely on two instructions of the MOVEI format: moveil, that moves the lower 16 bits moveih, that moves the higher 16 bits = 0x = 0x3040

35 64-bit scalar constants 64 bit constants are split into two 32 bit constants that are loaded with two couples of moveil/moveih in two 32-bit registers. Then two 32-bit move instructions are used to move the contents of these two 32-bit registers into the lower and higher part of a 64-bit registers.

36 Natively supported vector arithmetic v16i32+v16i32 0x0 0x40 Vectors are placed in the same section as the function so they can be accessed with PC relative addresses.

37 Non-natively supported vector arithmetic v16i8+v16i8 Sign extend instructions are emitted to support the promotion of each element in the vector Load/store instructions still works on the original vector types, even after the arithmetic operation Save memory space

38 Different types vector arithmetic v8i8+v8i64 Intrinsics are required to explicitly promote vector types After the promotion, the information related to the original vector type is lost

39 Scratchpad memory We rely on GNU GCC attributes [1] scratchpad is defined as: #define scratchpad attribute((scratchpad)) attribute((scratchpad)) is made up of: attribute ((section( scratchpad"))) that is used to create a new section in the ELF attribute ((address_space(77))) that is used to define a new address space [1]

40 programming Nu+: exploit parallelism Three levels of exploitable parallelism: Vector lanes (SIMD) Hardware multithreading Multi-core require custom vector types require nu+ builtins 40

41 programming Nu+: vector support Operators between vector types: Arithmetic operators (+, -, *, /, %) Relational operators (==,!=, <, <=, >, >=) Bitwise operators (&,, ^, ~, <<, >>) Logical operators (&&,,!) Assignment operators (=, +=, -=, *=, /=, %=, <<=, >>=, &=, ^=, =) 41

42 vector support: from C to OpenCL #include <stdint.h> int a [16] attribute ((aligned(64))) = {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16}; int main (){ vec16i32* va = reinterpret_cast<vec16i32*>(&a); } Conversion between int [16] and vec16i32 Nu+ vector types are 64-byte aligned Conversion possible only if both types have the same alignment reinterpret_cast: compiler directive which instructs the compiler to treat the sequence of bits as if it had a different type. 42

43 vector support: vector and vector/scalar sums #include <stdint.h> int main (){ vec16i32 a; vec16i32 b; vec16i32 c = a+b; } #include <stdint.h> int main (){ vec16i32 a; int b; vec16i32 c = a+b; } Define two vectors of 16 integer elements Compute the vector sum a+b Define one vectors of 16 integer elements and a scalar Compute the sum between vector a and scalar b 43

44 vector support: vector initialization Vectors can be initialized using curly bracket syntax #include <stdint.h> int main (){ const vec16i32 a = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 }; } a non-constant vector a constant vector #include <stdint.h> int main (){ int x, y, z;... vec16i32 a = { x, y, z, x, y, z, x, y, z, x, y, z, x, y, z, x}; } 44

45 vector support: operator [] #include <stdint.h> int main (){ vec16i32 a; // assign some values: for (int i=0; i<16; i++) a[i]=i; int sum = 0; // calculate sum for (int i=0; i<16; i++) sum += a[i]; } the operator [] can be used to access vector elements 45

46 vector support: comparisons Two possibilities: relational operators Specific builtins (optimized for nu+) #include <stdint.h> int main (){ vec16i32 a; vec16i32 b; int c = builtin_nuplus_mask_cmpi32_slt (a, b) } The integer c will contain a bitmap where each bit will be equal to 0 or 1 according to the result of the comparison. 46

47 vector support: handling SIMD control flow #include <stdint.h> int main (){ vec16i32 a; vec16i32 b; int c = builtin_nuplus_mask_cmpi32_slt (a, b); int rm_old = builtin_nuplus_read_mask_reg(); builtin_nuplus_write_mask_reg(c); do_something(); c = c^-1; builtin_nuplus_write_mask_reg(c); do_somethingelse(); builtin_nuplus_write_mask_reg(rm_old); } At the beginning all lanes are enabled SIMD control flow through masking operations Steps: 1. generate mask for a<b 2. save mask register 3. write mask register for a<b 4. generate mask for a>=b 5. write mask register for a>=b 6. restore the old mask 47

48 programming Nu+: multithreading support Explicitly handled by the programmer using builtins: builtin_nuplus_read_control_reg(2): for each hardware thread, returns the thread id builtin_nuplus_barrier(int ID, int number_of_threads): thread synchronization using hardware barrier 48

49 programming Nu+: multithreading support TLP: Independent stacks Private register files Shared Caches and SPM Same entry point, but different flows. Programmers can specialize or span tasks over different threads using their IDs Barrier synchronization support 49

50 programming Nu+: coherence mechanism Policies: builtin_nuplus_write_control_reg(16,1): set write-through builtin_nuplus_write_control_reg(16,0): set write-back High-performance Require explicit flush of data to main memory through builtin_nuplus_flush(int data_address); #include <stdint.h> int main (){ vec16i32 a; vec16i32 b; vec16i32 c = a+b; builtin_nuplus_flush((int)(&c) ); } 50

51 programming Nu+: Scratchpad memory Variables declared with the scratchpad attribute are placed in the scratchpad using appropriate load/store instructions Note that just global variable can be placed in the scratchpad memory 51

52 programming Nu+: Custom operations Custom hardware Customizing nu+ with a specific functional unit (SFU): Add the HDL code in the hardware project The ISA is provided with specific instructions to use the SFU. Builtins are exposed to the Some builtins: programmer to exploit the int builtin_nuplus_f1_int(int a, int b); SFU float builtin_nuplus_f1_float(float a, float b); 52

INTRODUCTION TO LLVM Bo Wang SA 2016 Fall

INTRODUCTION TO LLVM Bo Wang SA 2016 Fall INTRODUCTION TO LLVM Bo Wang SA 2016 Fall LLVM Basic LLVM IR LLVM Pass OUTLINE What is LLVM? LLVM is a compiler infrastructure designed as a set of reusable libraries with well-defined interfaces. Implemented

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

ECE 5775 (Fall 17) High-Level Digital Design Automation. Static Single Assignment

ECE 5775 (Fall 17) High-Level Digital Design Automation. Static Single Assignment ECE 5775 (Fall 17) High-Level Digital Design Automation Static Single Assignment Announcements HW 1 released (due Friday) Student-led discussions on Tuesday 9/26 Sign up on Piazza: 3 students / group Meet

More information

Portland State University ECE 588/688. Cray-1 and Cray T3E

Portland State University ECE 588/688. Cray-1 and Cray T3E Portland State University ECE 588/688 Cray-1 and Cray T3E Copyright by Alaa Alameldeen 2014 Cray-1 A successful Vector processor from the 1970s Vector instructions are examples of SIMD Contains vector

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

Compiler construction. x86 architecture. This lecture. Lecture 6: Code generation for x86. x86: assembly for a real machine.

Compiler construction. x86 architecture. This lecture. Lecture 6: Code generation for x86. x86: assembly for a real machine. This lecture Compiler construction Lecture 6: Code generation for x86 Magnus Myreen Spring 2018 Chalmers University of Technology Gothenburg University x86 architecture s Some x86 instructions From LLVM

More information

Supporting the new IBM z13 mainframe and its SIMD vector unit

Supporting the new IBM z13 mainframe and its SIMD vector unit Supporting the new IBM z13 mainframe and its SIMD vector unit Dr. Ulrich Weigand Senior Technical Staff Member GNU/Linux Compilers & Toolchain Date: Apr 13, 2015 2015 IBM Corporation Agenda IBM z13 Vector

More information

Introduction to the MIPS. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Introduction to the MIPS. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Introduction to the MIPS Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Introduction to the MIPS The Microprocessor without Interlocked Pipeline Stages

More information

Introduction. L25: Modern Compiler Design

Introduction. L25: Modern Compiler Design Introduction L25: Modern Compiler Design Course Aims Understand the performance characteristics of modern processors Be familiar with strategies for optimising dynamic dispatch for languages like JavaScript

More information

GPU Fundamentals Jeff Larkin November 14, 2016

GPU Fundamentals Jeff Larkin November 14, 2016 GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate

More information

15-411: LLVM. Jan Hoffmann. Substantial portions courtesy of Deby Katz

15-411: LLVM. Jan Hoffmann. Substantial portions courtesy of Deby Katz 15-411: LLVM Jan Hoffmann Substantial portions courtesy of Deby Katz and Gennady Pekhimenko, Olatunji Ruwase,Chris Lattner, Vikram Adve, and David Koes Carnegie What is LLVM? A collection of modular and

More information

From Application to Technology OpenCL Application Processors Chung-Ho Chen

From Application to Technology OpenCL Application Processors Chung-Ho Chen From Application to Technology OpenCL Application Processors Chung-Ho Chen Computer Architecture and System Laboratory (CASLab) Department of Electrical Engineering and Institute of Computer and Communication

More information

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University

More information

How to Write Fast Numerical Code

How to Write Fast Numerical Code How to Write Fast Numerical Code Lecture: Memory hierarchy, locality, caches Instructor: Markus Püschel TA: Alen Stojanov, Georg Ofenbeck, Gagandeep Singh Organization Temporal and spatial locality Memory

More information

Handout 3. HSAIL and A SIMT GPU Simulator

Handout 3. HSAIL and A SIMT GPU Simulator Handout 3 HSAIL and A SIMT GPU Simulator 1 Outline Heterogeneous System Introduction of HSA Intermediate Language (HSAIL) A SIMT GPU Simulator Summary 2 Heterogeneous System CPU & GPU CPU GPU CPU wants

More information

Occupancy-based compilation

Occupancy-based compilation Occupancy-based compilation Advanced Course on Compilers Spring 2015 (III-V): Lecture 10 Vesa Hirvisalo ESG/CSE/Aalto Today Threads and occupancy GPUs as the example SIMT execution warp (thread-group)

More information

HSAIL: PORTABLE COMPILER IR FOR HSA

HSAIL: PORTABLE COMPILER IR FOR HSA HSAIL: PORTABLE COMPILER IR FOR HSA HOT CHIPS TUTORIAL - AUGUST 2013 BEN SANDER AMD SENIOR FELLOW STATE OF GPU COMPUTING GPUs are fast and power efficient : high compute density per-mm and per-watt But:

More information

Architectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language.

Architectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language. Architectures & instruction sets Computer architecture taxonomy. Assembly language. R_B_T_C_ 1. E E C E 2. I E U W 3. I S O O 4. E P O I von Neumann architecture Memory holds data and instructions. Central

More information

Lecture 4: MIPS Instruction Set

Lecture 4: MIPS Instruction Set Lecture 4: MIPS Instruction Set No class on Tuesday Today s topic: MIPS instructions Code examples 1 Instruction Set Understanding the language of the hardware is key to understanding the hardware/software

More information

Processors, Performance, and Profiling

Processors, Performance, and Profiling Processors, Performance, and Profiling Architecture 101: 5-Stage Pipeline Fetch Decode Execute Memory Write-Back Registers PC FP ALU Memory Architecture 101 1. Fetch instruction from memory. 2. Decode

More information

DATA-LEVEL PARALLELISM IN VECTOR, SIMD ANDGPU ARCHITECTURES(PART 2)

DATA-LEVEL PARALLELISM IN VECTOR, SIMD ANDGPU ARCHITECTURES(PART 2) 1 DATA-LEVEL PARALLELISM IN VECTOR, SIMD ANDGPU ARCHITECTURES(PART 2) Chapter 4 Appendix A (Computer Organization and Design Book) OUTLINE SIMD Instruction Set Extensions for Multimedia (4.3) Graphical

More information

Instruction Set Architecture

Instruction Set Architecture Computer Architecture Instruction Set Architecture Lynn Choi Korea University Machine Language Programming language High-level programming languages Procedural languages: C, PASCAL, FORTRAN Object-oriented

More information

Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Active thread Idle thread

Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Active thread Idle thread Intra-Warp Compaction Techniques Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Goal Active thread Idle thread Compaction Compact threads in a warp to coalesce (and eliminate)

More information

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017 Design of Digital Circuits Lecture 21: GPUs Prof. Onur Mutlu ETH Zurich Spring 2017 12 May 2017 Agenda for Today & Next Few Lectures Single-cycle Microarchitectures Multi-cycle and Microprogrammed Microarchitectures

More information

Computer Systems. Binary Representation. Binary Representation. Logical Computation: Boolean Algebra

Computer Systems. Binary Representation. Binary Representation. Logical Computation: Boolean Algebra Binary Representation Computer Systems Information is represented as a sequence of binary digits: Bits What the actual bits represent depends on the context: Seminar 3 Numerical value (integer, floating

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

Delft-Java Link Translation Buffer

Delft-Java Link Translation Buffer Delft-Java Link Translation Buffer John Glossner 1,2 and Stamatis Vassiliadis 2 1 Lucent / Bell Labs Advanced DSP Architecture and Compiler Research Allentown, Pa glossner@lucent.com 2 Delft University

More information

Computer Organization and Design, 5th Edition: The Hardware/Software Interface

Computer Organization and Design, 5th Edition: The Hardware/Software Interface Computer Organization and Design, 5th Edition: The Hardware/Software Interface 1 Computer Abstractions and Technology 1.1 Introduction 1.2 Eight Great Ideas in Computer Architecture 1.3 Below Your Program

More information

XPU A Programmable FPGA Accelerator for Diverse Workloads

XPU A Programmable FPGA Accelerator for Diverse Workloads XPU A Programmable FPGA Accelerator for Diverse Workloads Jian Ouyang, 1 (ouyangjian@baidu.com) Ephrem Wu, 2 Jing Wang, 1 Yupeng Li, 1 Hanlin Xie 1 1 Baidu, Inc. 2 Xilinx Outlines Background - FPGA for

More information

Introduction Architecture overview. Multi-cluster architecture Addressing modes. Single-cluster Pipeline. architecture Instruction folding

Introduction Architecture overview. Multi-cluster architecture Addressing modes. Single-cluster Pipeline. architecture Instruction folding ST20 icore and architectures D Albis Tiziano 707766 Architectures for multimedia systems Politecnico di Milano A.A. 2006/2007 Outline ST20-iCore Introduction Introduction Architecture overview Multi-cluster

More information

CLICK TO EDIT MASTER TITLE STYLE. Click to edit Master text styles. Second level Third level Fourth level Fifth level

CLICK TO EDIT MASTER TITLE STYLE. Click to edit Master text styles. Second level Third level Fourth level Fifth level CLICK TO EDIT MASTER TITLE STYLE Second level THE HETEROGENEOUS SYSTEM ARCHITECTURE ITS (NOT) ALL ABOUT THE GPU PAUL BLINZER, FELLOW, HSA SYSTEM SOFTWARE, AMD SYSTEM ARCHITECTURE WORKGROUP CHAIR, HSA FOUNDATION

More information

Intermediate Representations

Intermediate Representations Intermediate Representations Intermediate Representations (EaC Chaper 5) Source Code Front End IR Middle End IR Back End Target Code Front end - produces an intermediate representation (IR) Middle end

More information

Overview Implicit Vectorisation Explicit Vectorisation Data Alignment Summary. Vectorisation. James Briggs. 1 COSMOS DiRAC.

Overview Implicit Vectorisation Explicit Vectorisation Data Alignment Summary. Vectorisation. James Briggs. 1 COSMOS DiRAC. Vectorisation James Briggs 1 COSMOS DiRAC April 28, 2015 Session Plan 1 Overview 2 Implicit Vectorisation 3 Explicit Vectorisation 4 Data Alignment 5 Summary Section 1 Overview What is SIMD? Scalar Processing:

More information

Compiling Techniques

Compiling Techniques Lecture 10: Introduction to 10 November 2015 Coursework: Block and Procedure Table of contents Introduction 1 Introduction Overview Java Virtual Machine Frames and Function Call 2 JVM Types and Mnemonics

More information

OpenCL Vectorising Features. Andreas Beckmann

OpenCL Vectorising Features. Andreas Beckmann Mitglied der Helmholtz-Gemeinschaft OpenCL Vectorising Features Andreas Beckmann Levels of Vectorisation vector units, SIMD devices width, instructions SMX, SP cores Cus, PEs vector operations within kernels

More information

Lecture 27: Pot-Pourri. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs Disks and reliability

Lecture 27: Pot-Pourri. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs Disks and reliability Lecture 27: Pot-Pourri Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs Disks and reliability 1 Shared-Memory Vs. Message-Passing Shared-memory: Well-understood

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

EITF20: Computer Architecture Part2.1.1: Instruction Set Architecture

EITF20: Computer Architecture Part2.1.1: Instruction Set Architecture EITF20: Computer Architecture Part2.1.1: Instruction Set Architecture Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Instruction Set Principles The Role of Compilers MIPS 2 Main Content Computer

More information

Portland State University ECE 588/688. Graphics Processors

Portland State University ECE 588/688. Graphics Processors Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly

More information

Introduction to C. Why C? Difference between Python and C C compiler stages Basic syntax in C

Introduction to C. Why C? Difference between Python and C C compiler stages Basic syntax in C Final Review CS304 Introduction to C Why C? Difference between Python and C C compiler stages Basic syntax in C Pointers What is a pointer? declaration, &, dereference... Pointer & dynamic memory allocation

More information

HSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017!

HSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017! Advanced Topics on Heterogeneous System Architectures HSA Foundation! Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017! Antonio R. Miele! Marco D. Santambrogio! Politecnico di Milano! 2

More information

Cymric A Framework for Prototyping Near-Memory Architectures

Cymric A Framework for Prototyping Near-Memory Architectures A Framework for Prototyping Near-Memory Architectures Chad D. Kersey 1, Hyesoon Kim 2, Sudhakar Yalamanchili 1 The rest of the team: Nathan Braswell, Jemmy Gazhenko, Prasun Gera, Meghana Gupta, Hyojong

More information

EC 513 Computer Architecture

EC 513 Computer Architecture EC 513 Computer Architecture Complex Pipelining: Superscalar Prof. Michel A. Kinsy Summary Concepts Von Neumann architecture = stored-program computer architecture Self-Modifying Code Princeton architecture

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function

A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function Chen-Ting Chang, Yu-Sheng Chen, I-Wei Wu, and Jyh-Jiun Shann Dept. of Computer Science, National Chiao

More information

Cortex-R5 Software Development

Cortex-R5 Software Development Cortex-R5 Software Development Course Description Cortex-R5 software development is a three days ARM official course. The course goes into great depth, and provides all necessary know-how to develop software

More information

Final Lecture. A few minutes to wrap up and add some perspective

Final Lecture. A few minutes to wrap up and add some perspective Final Lecture A few minutes to wrap up and add some perspective 1 2 Instant replay The quarter was split into roughly three parts and a coda. The 1st part covered instruction set architectures the connection

More information

ECE 8823: GPU Architectures. Objectives

ECE 8823: GPU Architectures. Objectives ECE 8823: GPU Architectures Introduction 1 Objectives Distinguishing features of GPUs vs. CPUs Major drivers in the evolution of general purpose GPUs (GPGPUs) 2 1 Chapter 1 Chapter 2: 2.2, 2.3 Reading

More information

Intermediate Representations

Intermediate Representations COMP 506 Rice University Spring 2018 Intermediate Representations source code IR Front End Optimizer Back End IR target code Copyright 2018, Keith D. Cooper & Linda Torczon, all rights reserved. Students

More information

Arquitetura e Organização de Computadores 2

Arquitetura e Organização de Computadores 2 Arquitetura e Organização de Computadores 2 Paralelismo em Nível de Dados Graphical Processing Units - GPUs Graphical Processing Units Given the hardware invested to do graphics well, how can be supplement

More information

Spectre and Meltdown. Clifford Wolf q/talk

Spectre and Meltdown. Clifford Wolf q/talk Spectre and Meltdown Clifford Wolf q/talk 2018-01-30 Spectre and Meltdown Spectre (CVE-2017-5753 and CVE-2017-5715) Is an architectural security bug that effects most modern processors with speculative

More information

C Programming. Course Outline. C Programming. Code: MBD101. Duration: 10 Hours. Prerequisites:

C Programming. Course Outline. C Programming. Code: MBD101. Duration: 10 Hours. Prerequisites: C Programming Code: MBD101 Duration: 10 Hours Prerequisites: You are a computer science Professional/ graduate student You can execute Linux/UNIX commands You know how to use a text-editing tool You should

More information

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics CS4230 Parallel Programming Lecture 3: Introduction to Parallel Architectures Mary Hall August 28, 2012 Homework 1: Parallel Programming Basics Due before class, Thursday, August 30 Turn in electronically

More information

Computer Architecture

Computer Architecture Computer Architecture Chapter 2 Instructions: Language of the Computer Fall 2005 Department of Computer Science Kent State University Assembly Language Encodes machine instructions using symbols and numbers

More information

Compilers and Code Optimization EDOARDO FUSELLA

Compilers and Code Optimization EDOARDO FUSELLA Compilers and Code Optimization EDOARDO FUSELLA The course covers Compiler architecture Pre-requisite Front-end Strong programming background in C, C++ Back-end LLVM Code optimization A case study: nu+

More information

Chapter 2A Instructions: Language of the Computer

Chapter 2A Instructions: Language of the Computer Chapter 2A Instructions: Language of the Computer Copyright 2009 Elsevier, Inc. All rights reserved. Instruction Set The repertoire of instructions of a computer Different computers have different instruction

More information

Instruction Set Design

Instruction Set Design Instruction Set Design software instruction set hardware CPE442 Lec 3 ISA.1 Instruction Set Architecture Programmer's View ADD SUBTRACT AND OR COMPARE... 01010 01110 10011 10001 11010... CPU Memory I/O

More information

06-2 EE Lecture Transparency. Formatted 14:50, 4 December 1998 from lsli

06-2 EE Lecture Transparency. Formatted 14:50, 4 December 1998 from lsli 06-1 Vector Processors, Etc. 06-1 Some material from Appendix B of Hennessy and Patterson. Outline Memory Latency Hiding v. Reduction Program Characteristics Vector Processors Data Prefetch Processor /DRAM

More information

Advanced CUDA Optimizations. Umar Arshad ArrayFire

Advanced CUDA Optimizations. Umar Arshad ArrayFire Advanced CUDA Optimizations Umar Arshad (@arshad_umar) ArrayFire (@arrayfire) ArrayFire World s leading GPU experts In the industry since 2007 NVIDIA Partner Deep experience working with thousands of customers

More information

Lecture 2 Overview of the LLVM Compiler

Lecture 2 Overview of the LLVM Compiler Lecture 2 Overview of the LLVM Compiler Abhilasha Jain Thanks to: VikramAdve, Jonathan Burket, DebyKatz, David Koes, Chris Lattner, Gennady Pekhimenko, and Olatunji Ruwase, for their slides The LLVM Compiler

More information

Lecture 3 Overview of the LLVM Compiler

Lecture 3 Overview of the LLVM Compiler LLVM Compiler System Lecture 3 Overview of the LLVM Compiler The LLVM Compiler Infrastructure - Provides reusable components for building compilers - Reduce the time/cost to build a new compiler - Build

More information

CS425 Computer Systems Architecture

CS425 Computer Systems Architecture CS425 Computer Systems Architecture Fall 2017 Multiple Issue: Superscalar and VLIW CS425 - Vassilis Papaefstathiou 1 Example: Dynamic Scheduling in PowerPC 604 and Pentium Pro In-order Issue, Out-of-order

More information

Page 1. Structure of von Nuemann machine. Instruction Set - the type of Instructions

Page 1. Structure of von Nuemann machine. Instruction Set - the type of Instructions Structure of von Nuemann machine Arithmetic and Logic Unit Input Output Equipment Main Memory Program Control Unit 1 1 Instruction Set - the type of Instructions Arithmetic + Logical (ADD, SUB, MULT, DIV,

More information

CS 654 Computer Architecture Summary. Peter Kemper

CS 654 Computer Architecture Summary. Peter Kemper CS 654 Computer Architecture Summary Peter Kemper Chapters in Hennessy & Patterson Ch 1: Fundamentals Ch 2: Instruction Level Parallelism Ch 3: Limits on ILP Ch 4: Multiprocessors & TLP Ap A: Pipelining

More information

Lecture 1: Overview of Java

Lecture 1: Overview of Java Lecture 1: Overview of Java What is java? Developed by Sun Microsystems (James Gosling) A general-purpose object-oriented language Based on C/C++ Designed for easy Web/Internet applications Widespread

More information

ISA: The Hardware Software Interface

ISA: The Hardware Software Interface ISA: The Hardware Software Interface Instruction Set Architecture (ISA) is where software meets hardware In embedded systems, this boundary is often flexible Understanding of ISA design is therefore important

More information

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming. Lecturer: Alan Christopher

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming. Lecturer: Alan Christopher CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming Lecturer: Alan Christopher Overview GP-GPU: What and why OpenCL, CUDA, and programming GPUs GPU Performance

More information

Implementation of DSP Algorithms

Implementation of DSP Algorithms Implementation of DSP Algorithms Main frame computers Dedicated (application specific) architectures Programmable digital signal processors voice band data modem speech codec 1 PDSP and General-Purpose

More information

Universität Dortmund. ARM Architecture

Universität Dortmund. ARM Architecture ARM Architecture The RISC Philosophy Original RISC design (e.g. MIPS) aims for high performance through o reduced number of instruction classes o large general-purpose register set o load-store architecture

More information

ENGN1640: Design of Computing Systems Topic 03: Instruction Set Architecture Design

ENGN1640: Design of Computing Systems Topic 03: Instruction Set Architecture Design ENGN1640: Design of Computing Systems Topic 03: Instruction Set Architecture Design Professor Sherief Reda http://scale.engin.brown.edu School of Engineering Brown University Spring 2014 Sources: Computer

More information

Amber Baruffa Vincent Varouh

Amber Baruffa Vincent Varouh Amber Baruffa Vincent Varouh Advanced RISC Machine 1979 Acorn Computers Created 1985 first RISC processor (ARM1) 25,000 transistors 32-bit instruction set 16 general purpose registers Load/Store Multiple

More information

Topics Power tends to corrupt; absolute power corrupts absolutely. Computer Organization CS Data Representation

Topics Power tends to corrupt; absolute power corrupts absolutely. Computer Organization CS Data Representation Computer Organization CS 231-01 Data Representation Dr. William H. Robinson November 12, 2004 Topics Power tends to corrupt; absolute power corrupts absolutely. Lord Acton British historian, late 19 th

More information

CUDA OPTIMIZATIONS ISC 2011 Tutorial

CUDA OPTIMIZATIONS ISC 2011 Tutorial CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

Embedded processors. Timo Töyry Department of Computer Science and Engineering Aalto University, School of Science timo.toyry(at)aalto.

Embedded processors. Timo Töyry Department of Computer Science and Engineering Aalto University, School of Science timo.toyry(at)aalto. Embedded processors Timo Töyry Department of Computer Science and Engineering Aalto University, School of Science timo.toyry(at)aalto.fi Comparing processors Evaluating processors Taxonomy of processors

More information

CSc 553. Principles of Compilation. 2 : Interpreters. Department of Computer Science University of Arizona

CSc 553. Principles of Compilation. 2 : Interpreters. Department of Computer Science University of Arizona CSc 553 Principles of Compilation 2 : Interpreters Department of Computer Science University of Arizona collberg@gmail.com Copyright c 2011 Christian Collberg Compiler source Lexer tokens Parser AST VM

More information

55:132/22C:160, HPCA Spring 2011

55:132/22C:160, HPCA Spring 2011 55:132/22C:160, HPCA Spring 2011 Second Lecture Slide Set Instruction Set Architecture Instruction Set Architecture ISA, the boundary between software and hardware Specifies the logical machine that is

More information

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

More information

GRAPHICS PROCESSING UNITS

GRAPHICS PROCESSING UNITS GRAPHICS PROCESSING UNITS Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 4, John L. Hennessy and David A. Patterson, Morgan Kaufmann, 2011

More information

Agenda. CSE P 501 Compilers. Java Implementation Overview. JVM Architecture. JVM Runtime Data Areas (1) JVM Data Types. CSE P 501 Su04 T-1

Agenda. CSE P 501 Compilers. Java Implementation Overview. JVM Architecture. JVM Runtime Data Areas (1) JVM Data Types. CSE P 501 Su04 T-1 Agenda CSE P 501 Compilers Java Implementation JVMs, JITs &c Hal Perkins Summer 2004 Java virtual machine architecture.class files Class loading Execution engines Interpreters & JITs various strategies

More information

PACE: Power-Aware Computing Engines

PACE: Power-Aware Computing Engines PACE: Power-Aware Computing Engines Krste Asanovic Saman Amarasinghe Martin Rinard Computer Architecture Group MIT Laboratory for Computer Science http://www.cag.lcs.mit.edu/ PACE Approach Energy- Conscious

More information

Advanced Computer Architecture

Advanced Computer Architecture ECE 563 Advanced Computer Architecture Fall 2010 Lecture 6: VLIW 563 L06.1 Fall 2010 Little s Law Number of Instructions in the pipeline (parallelism) = Throughput * Latency or N T L Throughput per Cycle

More information

( ZIH ) Center for Information Services and High Performance Computing. Overvi ew over the x86 Processor Architecture

( ZIH ) Center for Information Services and High Performance Computing. Overvi ew over the x86 Processor Architecture ( ZIH ) Center for Information Services and High Performance Computing Overvi ew over the x86 Processor Architecture Daniel Molka Ulf Markwardt Daniel.Molka@tu-dresden.de ulf.markwardt@tu-dresden.de Outline

More information

Department of Computer Science, Institute for System Architecture, Operating Systems Group. Real-Time Systems '08 / '09. Hardware.

Department of Computer Science, Institute for System Architecture, Operating Systems Group. Real-Time Systems '08 / '09. Hardware. Department of Computer Science, Institute for System Architecture, Operating Systems Group Real-Time Systems '08 / '09 Hardware Marcus Völp Outlook Hardware is Source of Unpredictability Caches Pipeline

More information

What is a compiler? Xiaokang Qiu Purdue University. August 21, 2017 ECE 573

What is a compiler? Xiaokang Qiu Purdue University. August 21, 2017 ECE 573 What is a compiler? Xiaokang Qiu Purdue University ECE 573 August 21, 2017 What is a compiler? What is a compiler? Traditionally: Program that analyzes and translates from a high level language (e.g.,

More information

Thomas Polzer Institut für Technische Informatik

Thomas Polzer Institut für Technische Informatik Thomas Polzer tpolzer@ecs.tuwien.ac.at Institut für Technische Informatik Branch to a labeled instruction if a condition is true Otherwise, continue sequentially beq rs, rt, L1 if (rs == rt) branch to

More information

Chapter 1 INTRODUCTION SYS-ED/ COMPUTER EDUCATION TECHNIQUES, INC.

Chapter 1 INTRODUCTION SYS-ED/ COMPUTER EDUCATION TECHNIQUES, INC. hapter 1 INTRODUTION SYS-ED/ OMPUTER EDUATION TEHNIQUES, IN. Objectives You will learn: Java features. Java and its associated components. Features of a Java application and applet. Java data types. Java

More information

Where We Are. Lexical Analysis. Syntax Analysis. IR Generation. IR Optimization. Code Generation. Machine Code. Optimization.

Where We Are. Lexical Analysis. Syntax Analysis. IR Generation. IR Optimization. Code Generation. Machine Code. Optimization. Where We Are Source Code Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Machine Code Where We Are Source Code Lexical Analysis Syntax Analysis

More information

Lecture 3 Overview of the LLVM Compiler

Lecture 3 Overview of the LLVM Compiler Lecture 3 Overview of the LLVM Compiler Jonathan Burket Special thanks to Deby Katz, Gennady Pekhimenko, Olatunji Ruwase, Chris Lattner, Vikram Adve, and David Koes for their slides The LLVM Compiler Infrastructure

More information

Short Notes of CS201

Short Notes of CS201 #includes: Short Notes of CS201 The #include directive instructs the preprocessor to read and include a file into a source code file. The file name is typically enclosed with < and > if the file is a system

More information

Administrivia. HW0 scores, HW1 peer-review assignments out. If you re having Cython trouble with HW2, let us know.

Administrivia. HW0 scores, HW1 peer-review assignments out. If you re having Cython trouble with HW2, let us know. Administrivia HW0 scores, HW1 peer-review assignments out. HW2 out, due Nov. 2. If you re having Cython trouble with HW2, let us know. Review on Wednesday: Post questions on Piazza Introduction to GPUs

More information

Manycore Processors. Manycore Chip: A chip having many small CPUs, typically statically scheduled and 2-way superscalar or scalar.

Manycore Processors. Manycore Chip: A chip having many small CPUs, typically statically scheduled and 2-way superscalar or scalar. phi 1 Manycore Processors phi 1 Definition Manycore Chip: A chip having many small CPUs, typically statically scheduled and 2-way superscalar or scalar. Manycore Accelerator: [Definition only for this

More information

Understand the factors involved in instruction set

Understand the factors involved in instruction set A Closer Look at Instruction Set Architectures Objectives Understand the factors involved in instruction set architecture design. Look at different instruction formats, operand types, and memory access

More information

Timed Compiled-Code Functional Simulation of Embedded Software for Performance Analysis of SOC Design

Timed Compiled-Code Functional Simulation of Embedded Software for Performance Analysis of SOC Design IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 22, NO. 1, JANUARY 2003 1 Timed Compiled-Code Functional Simulation of Embedded Software for Performance Analysis of

More information

ECE 574 Cluster Computing Lecture 15

ECE 574 Cluster Computing Lecture 15 ECE 574 Cluster Computing Lecture 15 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 30 March 2017 HW#7 (MPI) posted. Project topics due. Update on the PAPI paper Announcements

More information

CPE300: Digital System Architecture and Design

CPE300: Digital System Architecture and Design CPE300: Digital System Architecture and Design Fall 2011 MW 17:30-18:45 CBC C316 Arithmetic Unit 10032011 http://www.egr.unlv.edu/~b1morris/cpe300/ 2 Outline Recap Chapter 3 Number Systems Fixed Point

More information

Introduction. CS 2210 Compiler Design Wonsun Ahn

Introduction. CS 2210 Compiler Design Wonsun Ahn Introduction CS 2210 Compiler Design Wonsun Ahn What is a Compiler? Compiler: A program that translates source code written in one language to a target code written in another language Source code: Input

More information

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 04 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 4.1 Potential speedup via parallelism from MIMD, SIMD, and both MIMD and SIMD over time for

More information