Exercise Session 6. Data Processing on Modern Hardware L Fall Semester Cagri Balkesen

Size: px
Start display at page:

Download "Exercise Session 6. Data Processing on Modern Hardware L Fall Semester Cagri Balkesen"

Transcription

1 Cagri Balkesen Data Processing on Modern Hardware Exercises Fall Exercise Session 6 Data Processing on Modern Hardware L Fall Semester 2012 Cagri Balkesen cagri.balkesen@inf.ethz.ch Department of Computer Science ETH Zurich, Switzerland 25 October 2012

2 Cagri Balkesen Data Processing on Modern Hardware Exercises Fall Flynn s Taxonomy Computer architecture classification according to Flynn [Fly72]: SMD: nstructions Control Unit SSD: Single nstruction Single Data Stream Data Processing Unit #4 Result SMD: Single nstruction Multiple Data Streams Data Processing Unit #3 Result MSD: Mutiple nstruction Single Data Stream Data Processing Unit #2 Result MMD: Multipe nstruction Multiple Data Streams Data Processing Unit #1 Result

3 Cagri Balkesen Data Processing on Modern Hardware Exercises Fall Early SMD Machines CDC STAR-100 Released 1974 Vector super computer supporting memory-to-memory vector operations Cray X-MP/28 (CAB) ntroduction 1982 Word length: 64 bit Memory: 8M words Register-to-register vector operations 8 vector registes with up to 64 words each ETH Cray X-MP/28 (CAB)

4 - - VO Variable Vector Length on a Cray X-MP Example: Vector integer addition of first 53 elements of two vector registers V1, V2, V3: Vector Register V3[0:52] V1[0:52] + V2[0:52] Line nstr. Description 1 A1 53 Set addr. reg. A1 to 53 2 VL A1 Set vector length to 53 3 V3 V1+V2 Perform addition Latency of add: VL+8 = 61 cycles (1 cycle 9.5 ns) Basic Operation of the Vector Section Vector Registers i=. [[ v7 V6 [ v5! v4 ((A0)+(Ak)) v3 V2 ((A0)+(Ak)) Vl -- ((A0)+(Ak)) o" Vector Control t Vector Length ] Ai source [RR89] Vector Control t Vector Mask Ak 4 Vj Vk Vi Vector Functional Units F Ak ll [ Shift ].T s,, S-! Si Sj Sh ' ~ Pop/Parity!, F~ll Vectorilt ' i Logi~, ' -'~ S~ nd ll" Logical U ' Add Floating Point Functional Units V. - -1, ~oipro--l[ V~l ] Approx. [.' ] Multiply ] ] Vs Ski! Add ~ i ii The Vector Pop/Paxity unit The Second Vector Logical unit shares shares its input path with the both its input and output paths with [ Reciprocal Approximation unit. the Floating Point Multiply unit. [ i i Figure 5.1. The vector section of the Cray X-MP. Cagri Balkesen Data Processing on Modern Hardware Exercises Fall

5 Cagri Balkesen Data Processing on Modern Hardware Exercises Fall Vector Units in Modern CPUs PowerPC AltiVec (Motorola/Freescale) VMX (BM), Velocity Engine (Apple) PowerPC G4, G5 and Cell BE bit vector registers Dedicated vector unit UltraSPARC AMD VS: Visual nstruction Set Uses 64-bit FPU registers 3D Now! (since AMD K6-2) nteger + single precision floating point ntel MMX Since Pentium MMX 8 64-bit registers (alias to FPU stack) Before Pentium no vector unit SSE SSE4 ntroduced in Pentium Dedicated vector unit, combined with MMX AVX 256-bit registers 3 operand, non-destructive instructions

6 ntel SSE SSE SSE2 Since Pentium bit vector registers xmm0,..., xmm7 Single precision FP Dedicated vector unit but shared resources with FPU Since Pentium V Double precision FP Extends MMX registers to 128 bit Full integer support on XMM registers (without using MMX registers) SSE3 SSSE3 SSE4 x86-64 AVX Since Pentium V (Prescott) New horizontal operations Since Core 2 (Merom) New permutaiton instructions Since Core 2 (Penryn) Dot product Adds additional vector registers xmm8,..., xmm bits ymm8,..., ymm15 Cagri Balkesen Data Processing on Modern Hardware Exercises Fall

7 Cagri Balkesen Data Processing on Modern Hardware Exercises Fall SSE2 Vector Registers nteger Data Types 128 bit vector register (16 bytes) 16 byte elements b 15 b 14 b 13 b 12 b 11 b 10 b 9 b 8 b 7 b 6 b 5 b 4 b 3 b 2 b 1 b 0 8 word elements w 7 w 6 w 5 w 4 w 3 w 2 w 1 w 0 4 double word elements dw 3 dw 2 dw 1 dw 0 2 quad word elements qw 1 qw 0 Floating Point Data Types 4 float elements f 3 f 2 f 1 f 0 2 double elements d 1 d 0

8 Cagri Balkesen Data Processing on Modern Hardware Exercises Fall Assignment: Warm-up Exercise Download skeleton C code from course website compareandcount.c does the following: Array is filled with 100 mio numbers [0, 99] Counts how many values > 42 mplement SMD-accelerated version using SSE ntrinsics Measure speed-up Caveats: Result of SMD-comparison 0 if true but all 1 s if false n order to count, either shift or exploit that = 1

9 Cagri Balkesen Data Processing on Modern Hardware Exercises Fall Assignment: Column Value De(compression) n file compression.c the following compression is implemented 32-to-8 bit 32-to-9 bit 32-to-7 bit Use C macros to switch between versions Serial decompression is implemented Execution time is measured and decompressed values validated mplement the following functions using SSE ntrinsics SMD_decompress8to32(... ) SMD_decompress9to32(... ) SMD_decompress7to32(... )

10 Cagri Balkesen Data Processing on Modern Hardware Exercises Fall Decompression Step 1: Copy Values Step 1: Bring data into proper 32-bit words: v 13 v 12 v 11 v 10 v 9 v 8 v 7 v 6 v 5 v 4 v 3 v 2 v 1 v 0 shuffle mask FF FF 4 3 FF FF 3 2 FF FF 2 1 FF FF 1 0 v 3 v 2 v 1 v 0 Use shuffle instructions to move bytes within SMD registers. m128i out = _mm_shuffle_epi8 (in, shufmask);

11 Cagri Balkesen Data Processing on Modern Hardware Exercises Fall Decompression Step 2: Establish Same Bit Alignment Step 2: Make all four words identically bit-aligned: v 3 v 2 v 1 v 0 3 bits 2 bits 1 bits 0 bits shift 0 bits shift 1 bits shift 2 bits shift 3 bits v 3 v 2 v 1 v 0 3 bits 3 bits 3 bits 3 bits SMD shift instructions do not support variable shift amounts!

12 Cagri Balkesen Data Processing on Modern Hardware Exercises Fall Decompression Step 3: Shift and Mask Step 3: Word-align data and mask out invalid bits: v 3 v 2 v 1 v 0 v 3 v 2 v 1 v 0 v 3 v 2 v 1 v 0 m128i shifted = _mm_srli_epi32 (in, 3); m128i result = _mm_and_si128 (shifted, maskval);

13 Cagri Balkesen Data Processing on Modern Hardware Exercises Fall Useful collection of SSE ntrinsics ntrinsic Function _mm_loadu_si128(src) _mm_storeu_si128(dest, reg) _mm_set_epi32(v0,v1,v2,v3) _mm_set_epi8(v0,...,v15) _mm_cmpgt_epi32(reg1, reg2) _mm_add_epi32(reg1, reg2) _mm_and_si128(reg, mask) _mm_mullo_epi32(reg1, reg2) _mm_extract_epi32(reg, pos) _mm_shuffle_epi32(reg, mask) _mm_srli_si128(reg, bytecnt) _mm_slli_si128(reg, bytecnt) _mm_srli_epi32(reg, bitcnt) _mm_slli_epi32(reg, bitcnt) Description Load data from memory into register Store data back in memory Load four 32-bit integers into register Load sixteen 8-bit integers into register Greater compare of the four 32-bit values in the registers Addition of the four 32-bit values in the registers Bitwise and of two registers masking Multiplication of the four 32-bit values in the registers Extract 32-bit integer at position pos form register Shuffle 32-bit integers according to the shuffle mask Shift entire register right (bytewise) Shift entire register left (bytewise) Shift 32-bit integers in register to the right (bitwise) Shift 32-bit integers in register to the left (bitwise) Rerfer to for details.

14 Cagri Balkesen Data Processing on Modern Hardware Exercises Fall Auto-Vectorization in gcc Recent versions of gcc can auto-vectorize C code Use command line options: -ftree-vectorize turn on auto-vectorization (default for -O3) -ftree-vectorizer-verbose=x set reporting verbosity level of vectorizer -msse to generate SSE code -msse2 to generate SSE2 code -msse3 to generate SSE3 code gcc does not vectorize if code contains braches. unconstrained pointers are used (aliasing). uncountable loops.

15 Cagri Balkesen Data Processing on Modern Hardware Exercises Fall Auto-Vectorization of Volcano Column-Store mplementation of artithmetic operator in engine.c: int next_arithop(int32_t *tuples, arithop_t *op) {... switch (op->operation) { case ADD: for (n=0; n<minnum; n++) tuples[n] = op->left_input[n]+ op->right_input[n]; break; case SUB: for (n=0; n<minnum; n++) { tuples[n] = op->left_input[n]- op->right_input[n]; break;... Compilation: $ gcc -m64 -O3 -msse2 -ftree-vectorize -ftree-vectorizer-verbose=3 -c engine.c... engine.c:104: note:not vectorized: unhandled data-ref engine.c:109: note:not vectorizer: unhandled data-ref... engine.c:95: note: vectorized 0 loops in function. Auto-vectorization failed op pointer, possible aliasing? restrict pointer

16 Cagri Balkesen Data Processing on Modern Hardware Exercises Fall Vectorized Volcano Column-Store using SSE Using restricted pointers in loops: int next_arithop(int32_t *tuples, arithop_t *op) {... int32_t* restrict l=op->left_input; int32_t* restrict r=op->right_input;... switch (op->operation) { case ADD: for (n=0; n<minnum; n++) tuples[n] = l[n]+r[n]; break; case SUB: for (n=0; n<minnum; n++) { tuples[n] = l[n]-r[n]; break;... Compilation: $ gcc -m64 -O3 -msse2 -ftree-vectorize -ftree-vectorizer-verbose=3 -c engine.c... engine.c:113: note: Alignment of access forced using peeling. engine.c:113: note: LOOP VECTORZED engine.c:121: note: Alignment of access forced using peeling. engine.c:121: note: LOOP VECTORZED... engine.c:102: note: vectorized 5 loops in function. successful vectorization

17 Cagri Balkesen Data Processing on Modern Hardware Exercises Fall Speedup SSE Vectorization of Volcano Column-Store Query: SELECT sum(orderkey+linenumber*shipdate) FROM lineitems data set: 6 million rows CPU: Core 2 Quad Q GHz non-vectorized ms gcc autovectorization ms Speedup = 1.10 Load is memory-bound, not CPU-bound.

18 Cagri Balkesen Data Processing on Modern Hardware Exercises Fall Hand in your Results to Subject = DPMH:assignment4 {your netzname} Body = Description of CPU you tested on, e.g., Xeon QuadCore L5520,( 4x 2267MHz) Attach plots + raw data Attach source code (optional)

19 References [Fly72] [RR89] Michael J. Flynn. Some computer organizations and their effectiveness. EEE Transactions on Computers, 21(9): , September Kay A. Robbins and Steven Robbins. The Cray X-MP/Model 24, chapter 5, pages Springer LNCS, [WPB + 09] Thomas Willhalm, Nicolae Popovici, Yazan Boshmaf, Hasso Plattner, Alexander Zeier, and Jan Schaffner. [ZR02] Simd-scan: Ultra fast in-memory table scan using on-chip vector processing units. PVLDB, 2(1): , Jingren Zhou and Kenneth A. Ross. mplementing database operations using SMD instructions. n SGMOD 02, pages , Madison, Wisconsin, USA, Cagri Balkesen Data Processing on Modern Hardware Exercises Fall

Dan Stafford, Justine Bonnot

Dan Stafford, Justine Bonnot Dan Stafford, Justine Bonnot Background Applications Timeline MMX 3DNow! Streaming SIMD Extension SSE SSE2 SSE3 and SSSE3 SSE4 Advanced Vector Extension AVX AVX2 AVX-512 Compiling with x86 Vector Processing

More information

Exercise Session 5. Data Processing on Modern Hardware L Fall Semester Cagri Balkesen

Exercise Session 5. Data Processing on Modern Hardware L Fall Semester Cagri Balkesen Cagri Balkesen Data Processing on Modern Hardware Exercises Fall 2012 1 Exercise Session 5 Data Processing on Modern Hardware 263-3502-00L Fall Semester 2012 Cagri Balkesen cagri.balkesen@inf.ethz.ch Department

More information

Pipelining and Vector Processing

Pipelining and Vector Processing Pipelining and Vector Processing Chapter 8 S. Dandamudi Outline Basic concepts Handling resource conflicts Data hazards Handling branches Performance enhancements Example implementations Pentium PowerPC

More information

High Performance Computing and Programming 2015 Lab 6 SIMD and Vectorization

High Performance Computing and Programming 2015 Lab 6 SIMD and Vectorization High Performance Computing and Programming 2015 Lab 6 SIMD and Vectorization 1 Introduction The purpose of this lab assignment is to give some experience in using SIMD instructions on x86 and getting compiler

More information

OpenCL Vectorising Features. Andreas Beckmann

OpenCL Vectorising Features. Andreas Beckmann Mitglied der Helmholtz-Gemeinschaft OpenCL Vectorising Features Andreas Beckmann Levels of Vectorisation vector units, SIMD devices width, instructions SMX, SP cores Cus, PEs vector operations within kernels

More information

ECE 571 Advanced Microprocessor-Based Design Lecture 4

ECE 571 Advanced Microprocessor-Based Design Lecture 4 ECE 571 Advanced Microprocessor-Based Design Lecture 4 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 January 2016 Homework #1 was due Announcements Homework #2 will be posted

More information

( ZIH ) Center for Information Services and High Performance Computing. Overvi ew over the x86 Processor Architecture

( ZIH ) Center for Information Services and High Performance Computing. Overvi ew over the x86 Processor Architecture ( ZIH ) Center for Information Services and High Performance Computing Overvi ew over the x86 Processor Architecture Daniel Molka Ulf Markwardt Daniel.Molka@tu-dresden.de ulf.markwardt@tu-dresden.de Outline

More information

SWAR: MMX, SSE, SSE 2 Multiplatform Programming

SWAR: MMX, SSE, SSE 2 Multiplatform Programming SWAR: MMX, SSE, SSE 2 Multiplatform Programming Relatore: dott. Matteo Roffilli roffilli@csr.unibo.it 1 What s SWAR? SWAR = SIMD Within A Register SIMD = Single Instruction Multiple Data MMX,SSE,SSE2,Power3DNow

More information

SIMD Instructions outside and inside Oracle 12c. Laurent Léturgez 2016

SIMD Instructions outside and inside Oracle 12c. Laurent Léturgez 2016 SIMD Instructions outside and inside Oracle 2c Laurent Léturgez 206 Whoami Oracle Consultant since 200 Former developer (C, Java, perl, PL/SQL) Owner@Premiseo: Data Management on Premise and in the Cloud

More information

SSE and SSE2. Timothy A. Chagnon 18 September All images from Intel 64 and IA 32 Architectures Software Developer's Manuals

SSE and SSE2. Timothy A. Chagnon 18 September All images from Intel 64 and IA 32 Architectures Software Developer's Manuals SSE and SSE2 Timothy A. Chagnon 18 September 2007 All images from Intel 64 and IA 32 Architectures Software Developer's Manuals Overview SSE: Streaming SIMD (Single Instruction Multiple Data) Extensions

More information

Data-Parallel Execution using SIMD Instructions

Data-Parallel Execution using SIMD Instructions Data-Parallel Execution using SIMD Instructions 1 / 26 Single Instruction Multiple Data data parallelism exposed by the instruction set CPU register holds multiple fixed-size values (e.g., 4 times 32-bit)

More information

Masterpraktikum Scientific Computing

Masterpraktikum Scientific Computing Masterpraktikum Scientific Computing High-Performance Computing Michael Bader Alexander Heinecke Technische Universität München, Germany Outline Logins Levels of Parallelism Single Processor Systems Von-Neumann-Principle

More information

Parallel Processing SIMD, Vector and GPU s

Parallel Processing SIMD, Vector and GPU s Parallel Processing SIMD, Vector and GPU s EECS4201 Fall 2016 York University 1 Introduction Vector and array processors Chaining GPU 2 Flynn s taxonomy SISD: Single instruction operating on Single Data

More information

CSE 160 Lecture 10. Instruction level parallelism (ILP) Vectorization

CSE 160 Lecture 10. Instruction level parallelism (ILP) Vectorization CSE 160 Lecture 10 Instruction level parallelism (ILP) Vectorization Announcements Quiz on Friday Signup for Friday labs sessions in APM 2013 Scott B. Baden / CSE 160 / Winter 2013 2 Particle simulation

More information

Parallel Processing SIMD, Vector and GPU s

Parallel Processing SIMD, Vector and GPU s Parallel Processing SIMD, ector and GPU s EECS4201 Comp. Architecture Fall 2017 York University 1 Introduction ector and array processors Chaining GPU 2 Flynn s taxonomy SISD: Single instruction operating

More information

Most of the slides in this lecture are either from or adapted from slides provided by the authors of the textbook Computer Systems: A Programmer s

Most of the slides in this lecture are either from or adapted from slides provided by the authors of the textbook Computer Systems: A Programmer s Most of the slides in this lecture are either from or adapted from slides provided by the authors of the textbook Computer Systems: A Programmer s Perspective, 2 nd Edition and are provided from the website

More information

Computer System Architecture

Computer System Architecture CSC 203 1.5 Computer System Architecture Budditha Hettige Department of Statistics and Computer Science University of Sri Jayewardenepura Microprocessors 2011 Budditha Hettige 2 Processor Instructions

More information

Figure 1: 128-bit registers introduced by SSE. 128 bits. xmm0 xmm1 xmm2 xmm3 xmm4 xmm5 xmm6 xmm7

Figure 1: 128-bit registers introduced by SSE. 128 bits. xmm0 xmm1 xmm2 xmm3 xmm4 xmm5 xmm6 xmm7 SE205 - TD1 Énoncé General Instructions You can download all source files from: https://se205.wp.mines-telecom.fr/td1/ SIMD-like Data-Level Parallelism Modern processors often come with instruction set

More information

Jignesh M. Patel. Blog:

Jignesh M. Patel. Blog: Jignesh M. Patel Blog: http://bigfastdata.blogspot.com Go back to the design Query Cache from Processing for Conscious 98s Modern (at Algorithms Hardware least for Hash Joins) 995 24 2 Processor Processor

More information

Computer System Architecture

Computer System Architecture CSC 203 1.5 Computer System Architecture Department of Statistics and Computer Science University of Sri Jayewardenepura Instruction Set Architecture (ISA) Level 2 Introduction 3 Instruction Set Architecture

More information

SIMD Programming CS 240A, 2017

SIMD Programming CS 240A, 2017 SIMD Programming CS 240A, 2017 1 Flynn* Taxonomy, 1966 In 2013, SIMD and MIMD most common parallelism in architectures usually both in same system! Most common parallel processing programming style: Single

More information

Intel released new technology call P6P

Intel released new technology call P6P P6 and IA-64 8086 released on 1978 Pentium release on 1993 8086 has upgrade by Pipeline, Super scalar, Clock frequency, Cache and so on But 8086 has limit, Hard to improve efficiency Intel released new

More information

Alex Bennée stsquad on #qemu Virtualization Linaro Projects: QEMU, KVM, ARM 2. 1

Alex Bennée stsquad on #qemu Virtualization Linaro Projects: QEMU, KVM, ARM 2. 1 VECTORS MEET VIRTUALIZATION ALEX BENNÉE FOSDEM 2018 1 INTRODUCTION Alex Bennée alex.bennee@linaro.org stsquad on #qemu Virtualization Developer @ Linaro Projects: QEMU, KVM, ARM 2. 1 WHAT IS QEMU? From:

More information

Intel Enterprise Processors Technology

Intel Enterprise Processors Technology Enterprise Processors Technology Kosuke Hirano Enterprise Platforms Group March 20, 2002 1 Agenda Architecture in Enterprise Xeon Processor MP Next Generation Itanium Processor Interconnect Technology

More information

Parallelized Progressive Network Coding with Hardware Acceleration

Parallelized Progressive Network Coding with Hardware Acceleration Parallelized Progressive Network Coding with Hardware Acceleration Hassan Shojania, Baochun Li Department of Electrical and Computer Engineering University of Toronto Network coding Information is coded

More information

Intel 64 and IA-32 Architectures Software Developer s Manual

Intel 64 and IA-32 Architectures Software Developer s Manual Intel 64 and IA-32 Architectures Software Developer s Manual Volume 1: Basic Architecture NOTE: The Intel 64 and IA-32 Architectures Software Developer's Manual consists of five volumes: Basic Architecture,

More information

Chap. 9 Pipeline and Vector Processing

Chap. 9 Pipeline and Vector Processing 9-1 Parallel Processing = Simultaneous data processing tasks for the purpose of increasing the computational speed Perform concurrent data processing to achieve faster execution time Multiple Functional

More information

VECTORIZING RECOMPRESSION IN COLUMN-BASED IN-MEMORY DATABASE SYSTEMS

VECTORIZING RECOMPRESSION IN COLUMN-BASED IN-MEMORY DATABASE SYSTEMS Dep. of Computer Science Institute for System Architecture, Database Technology Group Master Thesis VECTORIZING RECOMPRESSION IN COLUMN-BASED IN-MEMORY DATABASE SYSTEMS Cheng Chen Matr.-Nr.: 3924687 Supervised

More information

Instruction Set extensions to X86. Floating Point SIMD instructions

Instruction Set extensions to X86. Floating Point SIMD instructions Instruction Set extensions to X86 Some extensions to x86 instruction set intended to accelerate 3D graphics AMD 3D-Now! Instructions simply accelerate floating point arithmetic. Accelerate object transformations

More information

EC 513 Computer Architecture

EC 513 Computer Architecture EC 513 Computer Architecture Complex Pipelining: Superscalar Prof. Michel A. Kinsy Summary Concepts Von Neumann architecture = stored-program computer architecture Self-Modifying Code Princeton architecture

More information

Intel 64 and IA-32 Architectures Software Developer s Manual

Intel 64 and IA-32 Architectures Software Developer s Manual Intel 64 and IA-32 Architectures Software Developer s Manual Volume 1: Basic Architecture NOTE: The Intel 64 and IA-32 Architectures Software Developer's Manual consists of seven volumes: Basic Architecture,

More information

EITF20: Computer Architecture Part2.1.1: Instruction Set Architecture

EITF20: Computer Architecture Part2.1.1: Instruction Set Architecture EITF20: Computer Architecture Part2.1.1: Instruction Set Architecture Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Instruction Set Principles The Role of Compilers MIPS 2 Main Content Computer

More information

A study on SIMD architecture

A study on SIMD architecture A study on SIMD architecture Gürkan Solmaz, Rouhollah Rahmatizadeh and Mohammad Ahmadian Department of Electrical Engineering and Computer Science University of Central Florida Email: {gsolmaz,rrahmati,mohammad}@knights.ucf.edu

More information

Fundamentals of Computer Design

Fundamentals of Computer Design CS359: Computer Architecture Fundamentals of Computer Design Yanyan Shen Department of Computer Science and Engineering 1 Defining Computer Architecture Agenda Introduction Classes of Computers 1.3 Defining

More information

Using Intel Streaming SIMD Extensions for 3D Geometry Processing

Using Intel Streaming SIMD Extensions for 3D Geometry Processing Using Intel Streaming SIMD Extensions for 3D Geometry Processing Wan-Chun Ma, Chia-Lin Yang Dept. of Computer Science and Information Engineering National Taiwan University firebird@cmlab.csie.ntu.edu.tw,

More information

Efficient Decoding of Posting Lists with SIMD Instructions

Efficient Decoding of Posting Lists with SIMD Instructions Journal of Computational Information Systems 11: 24 (2015) 7747 7755 Available at http://www.jofcis.com Efficient Decoding of Posting Lists with SIMD Instructions Naiyong AO 1, Xiaoguang LIU 2, Gang WANG

More information

EN164: Design of Computing Systems Lecture 24: Processor / ILP 5

EN164: Design of Computing Systems Lecture 24: Processor / ILP 5 EN164: Design of Computing Systems Lecture 24: Processor / ILP 5 Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University

More information

Improving the Addweighted Function in OpenCV 3.0 Using SSE and AVX Intrinsics

Improving the Addweighted Function in OpenCV 3.0 Using SSE and AVX Intrinsics Improving the Addweighted Function in OpenCV 3.0 Using SSE and AVX Intrinsics Panyayot Chaikan and Somsak Mitatha Abstract This paper presents a new algorithm for improving the speed of OpenCV s addeighted

More information

last time out-of-order execution and instruction queues the data flow model idea

last time out-of-order execution and instruction queues the data flow model idea 1 last time 2 out-of-order execution and instruction queues the data flow model idea graph of operations linked by depedencies latency bound need to finish longest dependency chain multiple accumulators

More information

Lecture 8. Vector Processing. 8.1 Flynn s taxonomy. M. J. Flynn proposed a categorization of parellel computer systems in 1966.

Lecture 8. Vector Processing. 8.1 Flynn s taxonomy. M. J. Flynn proposed a categorization of parellel computer systems in 1966. Lecture 8 Vector Processing 8.1 Flynn s taxonomy M. J. Flynn proposed a categorization of parellel computer systems in 1966. Single Instruction, Single Data stream (SISD) Single Instruction, Multiple Data

More information

Instruction Set Principles and Examples. Appendix B

Instruction Set Principles and Examples. Appendix B Instruction Set Principles and Examples Appendix B Outline What is Instruction Set Architecture? Classifying ISA Elements of ISA Programming Registers Type and Size of Operands Addressing Modes Types of

More information

SIMD. Utilization of a SIMD unit in the OS Kernel. Shogo Saito 1 and Shuichi Oikawa 2 2. SIMD. SIMD (Single SIMD SIMD SIMD SIMD

SIMD. Utilization of a SIMD unit in the OS Kernel. Shogo Saito 1 and Shuichi Oikawa 2 2. SIMD. SIMD (Single SIMD SIMD SIMD SIMD OS SIMD 1 2 SIMD (Single Instruction Multiple Data) SIMD OS (Operating System) SIMD SIMD OS Utilization of a SIMD unit in the OS Kernel Shogo Saito 1 and Shuichi Oikawa 2 Nowadays, it is very common that

More information

SIMD Exploitation in (JIT) Compilers

SIMD Exploitation in (JIT) Compilers SIMD Exploitation in (JIT) Compilers Hiroshi Inoue, IBM Research - Tokyo 1 What s SIMD? Single Instruction Multiple Data Same operations applied for multiple elements in a vector register input 1 A0 input

More information

Technical Report. Research Lab: LERIA

Technical Report. Research Lab: LERIA Technical Report Improvement of Fitch function for Maximum Parsimony in Phylogenetic Reconstruction with Intel AVX2 assembler instructions Research Lab: LERIA TR20130624-1 Version 1.0 24 June 2013 JEAN-MICHEL

More information

Computer Science 324 Computer Architecture Mount Holyoke College Fall Topic Notes: MIPS Instruction Set Architecture

Computer Science 324 Computer Architecture Mount Holyoke College Fall Topic Notes: MIPS Instruction Set Architecture Computer Science 324 Computer Architecture Mount Holyoke College Fall 2007 Topic Notes: MIPS Instruction Set Architecture vonneumann Architecture Modern computers use the vonneumann architecture. Idea:

More information

Lecture 16 SSE vectorprocessing SIMD MultimediaExtensions

Lecture 16 SSE vectorprocessing SIMD MultimediaExtensions Lecture 16 SSE vectorprocessing SIMD MultimediaExtensions Improving performance with SSE We ve seen how we can apply multithreading to speed up the cardiac simulator But there is another kind of parallelism

More information

CPE300: Digital System Architecture and Design

CPE300: Digital System Architecture and Design CPE300: Digital System Architecture and Design Fall 2011 MW 17:30-18:45 CBC C316 Arithmetic Unit 10032011 http://www.egr.unlv.edu/~b1morris/cpe300/ 2 Outline Recap Chapter 3 Number Systems Fixed Point

More information

Computer Architecture and Organization

Computer Architecture and Organization 10-1 Chapter 10 - Advanced Computer Architecture Computer Architecture and Organization Miles Murdocca and Vincent Heuring Chapter 10 Advanced Computer Architecture 10-2 Chapter 10 - Advanced Computer

More information

FAST FIR FILTERS FOR SIMD PROCESSORS WITH LIMITED MEMORY BANDWIDTH

FAST FIR FILTERS FOR SIMD PROCESSORS WITH LIMITED MEMORY BANDWIDTH Key words: Digital Signal Processing, FIR filters, SIMD processors, AltiVec. Grzegorz KRASZEWSKI Białystok Technical University Department of Electrical Engineering Wiejska

More information

Storage I/O Summary. Lecture 16: Multimedia and DSP Architectures

Storage I/O Summary. Lecture 16: Multimedia and DSP Architectures Storage I/O Summary Storage devices Storage I/O Performance Measures» Throughput» Response time I/O Benchmarks» Scaling to track technological change» Throughput with restricted response time is normal

More information

Chapter 06: Instruction Pipelining and Parallel Processing. Lesson 14: Example of the Pipelined CISC and RISC Processors

Chapter 06: Instruction Pipelining and Parallel Processing. Lesson 14: Example of the Pipelined CISC and RISC Processors Chapter 06: Instruction Pipelining and Parallel Processing Lesson 14: Example of the Pipelined CISC and RISC Processors 1 Objective To understand pipelines and parallel pipelines in CISC and RISC Processors

More information

Chapter 1. Computer Abstractions and Technology. Lesson 3: Understanding Performance

Chapter 1. Computer Abstractions and Technology. Lesson 3: Understanding Performance Chapter 1 Computer Abstractions and Technology Lesson 3: Understanding Performance Manufacturing ICs 1.7 Real Stuff: The AMD Opteron X4 Yield: proportion of working dies per wafer Chapter 1 Computer Abstractions

More information

CSCI 402: Computer Architectures. Instructions: Language of the Computer (4) Fengguang Song Department of Computer & Information Science IUPUI

CSCI 402: Computer Architectures. Instructions: Language of the Computer (4) Fengguang Song Department of Computer & Information Science IUPUI CSCI 402: Computer Architectures Instructions: Language of the Computer (4) Fengguang Song Department of Computer & Information Science IUPUI op Instruction address 6 bits 26 bits Jump Addressing J-type

More information

AMD Opteron TM & PGI: Enabling the Worlds Fastest LS-DYNA Performance

AMD Opteron TM & PGI: Enabling the Worlds Fastest LS-DYNA Performance 3. LS-DY Anwenderforum, Bamberg 2004 CAE / IT II AMD Opteron TM & PGI: Enabling the Worlds Fastest LS-DY Performance Tim Wilkens Ph.D. Member of Technical Staff tim.wilkens@amd.com October 4, 2004 Computation

More information

Lecture 4. Instruction Level Parallelism Vectorization, SSE Optimizing for the memory hierarchy

Lecture 4. Instruction Level Parallelism Vectorization, SSE Optimizing for the memory hierarchy Lecture 4 Instruction Level Parallelism Vectorization, SSE Optimizing for the memory hierarchy Partners? Announcements Scott B. Baden / CSE 160 / Winter 2011 2 Today s lecture Why multicore? Instruction

More information

Computer Architecture: SIMD and GPUs (Part I) Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: SIMD and GPUs (Part I) Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: SIMD and GPUs (Part I) Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-447 Spring 2013, Computer Architecture, Lecture 15: Dataflow

More information

COE608: Computer Organization and Architecture

COE608: Computer Organization and Architecture Add on Instruction Set Architecture COE608: Computer Organization and Architecture Dr. Gul N. Khan http://www.ee.ryerson.ca/~gnkhan Electrical and Computer Engineering Ryerson University Overview More

More information

Intel SIMD. Chris Phillips LBA Lead Scien-st November 2014 ASTRONOMY AND SPACE SCIENCE

Intel SIMD. Chris Phillips LBA Lead Scien-st November 2014 ASTRONOMY AND SPACE SCIENCE Intel SIMD Chris Phillips LBA Lead Scien-st November 2014 ASTRONOMY AND SPACE SCIENCE SIMD Single Instruc-on Mul-ple Data Vector extensions for x86 processors Parallel opera-ons More registers than regular

More information

EITF20: Computer Architecture Part2.1.1: Instruction Set Architecture

EITF20: Computer Architecture Part2.1.1: Instruction Set Architecture EITF20: Computer Architecture Part2.1.1: Instruction Set Architecture Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Instruction Set Principles The Role of Compilers MIPS 2 Main Content Computer

More information

EE 457. EE 457 Unit 0. Prerequisites. Course Info Lecture: Prof. Redekopp Class Introduction Basic Hardware Organization

EE 457. EE 457 Unit 0. Prerequisites. Course Info Lecture: Prof. Redekopp Class Introduction Basic Hardware Organization 0.1 0.2 EE 457 EE 457 Unit 0 Class Introduction Basic Hardware Organization Focus on CPU Design Microarchitecture General Digital System Design Focus on Hierarchy Cache Virtual Focus on Computer Arithmetic

More information

Exploiting automatic vectorization to employ SPMD on SIMD registers

Exploiting automatic vectorization to employ SPMD on SIMD registers Exploiting automatic vectorization to employ SPMD on SIMD registers Stefan Sprenger Department of Computer Science Humboldt-Universität zu Berlin Berlin, Germany sprengsz@informatik.hu-berlin.de Steffen

More information

SIMD: Data parallel execution

SIMD: Data parallel execution ERLANGEN REGIONAL COMPUTING CENTER SIMD: Data parallel execution J. Eitzinger HLRS, 15.6.2018 CPU Stored Program Computer: Base setting Memory for (int j=0; j

More information

CS 261 Fall Mike Lam, Professor. x86-64 Data Structures and Misc. Topics

CS 261 Fall Mike Lam, Professor. x86-64 Data Structures and Misc. Topics CS 261 Fall 2017 Mike Lam, Professor x86-64 Data Structures and Misc. Topics Topics Homogeneous data structures Arrays Nested / multidimensional arrays Heterogeneous data structures Structs / records Unions

More information

History of the Intel 80x86

History of the Intel 80x86 Intel s IA-32 Architecture Cptr280 Dr Curtis Nelson History of the Intel 80x86 1971 - Intel invents the microprocessor, the 4004 1975-8080 introduced 8-bit microprocessor 1978-8086 introduced 16 bit microprocessor

More information

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown

More information

Architecture I. Computer Systems Laboratory Sungkyunkwan University

Architecture I. Computer Systems Laboratory Sungkyunkwan University MIPS Instruction ti Set Architecture I Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Architecture (1) the attributes of a system as seen by the

More information

Lecture 25: Interrupt Handling and Multi-Data Processing. Spring 2018 Jason Tang

Lecture 25: Interrupt Handling and Multi-Data Processing. Spring 2018 Jason Tang Lecture 25: Interrupt Handling and Multi-Data Processing Spring 2018 Jason Tang 1 Topics Interrupt handling Vector processing Multi-data processing 2 I/O Communication Software needs to know when: I/O

More information

Instruction Set Progression. from MMX Technology through Streaming SIMD Extensions 2

Instruction Set Progression. from MMX Technology through Streaming SIMD Extensions 2 Instruction Set Progression from MMX Technology through Streaming SIMD Extensions 2 This article summarizes the progression of change to the instruction set in the Intel IA-32 architecture, from MMX technology

More information

Data-Level Parallelism in SIMD and Vector Architectures. Advanced Computer Architectures, Laura Pozzi & Cristina Silvano

Data-Level Parallelism in SIMD and Vector Architectures. Advanced Computer Architectures, Laura Pozzi & Cristina Silvano Data-Level Parallelism in SIMD and Vector Architectures Advanced Computer Architectures, Laura Pozzi & Cristina Silvano 1 Current Trends in Architecture Cannot continue to leverage Instruction-Level parallelism

More information

CSCI 402: Computer Architectures

CSCI 402: Computer Architectures CSCI 402: Computer Architectures Arithmetic for Computers (5) Fengguang Song Department of Computer & Information Science IUPUI What happens when the exact result is not any floating point number, too

More information

Guy Blank Intel Corporation, Israel March 27-28, 2017 European LLVM Developers Meeting Saarland Informatics Campus, Saarbrücken, Germany

Guy Blank Intel Corporation, Israel March 27-28, 2017 European LLVM Developers Meeting Saarland Informatics Campus, Saarbrücken, Germany Guy Blank Intel Corporation, Israel March 27-28, 2017 European LLVM Developers Meeting Saarland Informatics Campus, Saarbrücken, Germany Motivation C AVX2 AVX512 New instructions utilized! Scalar performance

More information

Image Processing Acceleration Techniques using Intel Streaming SIMD Extensions and Intel Advanced Vector Extensions

Image Processing Acceleration Techniques using Intel Streaming SIMD Extensions and Intel Advanced Vector Extensions Image Processing Acceleration Techniques using Intel Streaming SIMD Extensions and Intel Advanced Vector Extensions September 4, 2009 Authors: Petter Larsson & Eric Palmer INFORMATION IN THIS DOCUMENT

More information

Parallelism and Performance Instructor: Steven Ho

Parallelism and Performance Instructor: Steven Ho Parallelism and Performance Instructor: Steven Ho Review of Last Lecture Cache Performance AMAT = HT + MR MP 2 Multilevel Cache Diagram Main Memory Legend: Request for data Return of data CPU L1$ Memory

More information

ILP Limit: Perfect/Infinite Hardware. Chapter 3: Limits of Instr Level Parallelism. ILP Limit: see Figure in book. Narrow Window Size

ILP Limit: Perfect/Infinite Hardware. Chapter 3: Limits of Instr Level Parallelism. ILP Limit: see Figure in book. Narrow Window Size Chapter 3: Limits of Instr Level Parallelism Ultimately, how much instruction level parallelism is there? Consider study by Wall (summarized in H & P) First, assume perfect/infinite hardware Then successively

More information

COSC 6385 Computer Architecture. Instruction Set Architectures

COSC 6385 Computer Architecture. Instruction Set Architectures COSC 6385 Computer Architecture Instruction Set Architectures Spring 2012 Instruction Set Architecture (ISA) Definition on Wikipedia: Part of the Computer Architecture related to programming Defines set

More information

Lecture 2. Memory locality optimizations Address space organization

Lecture 2. Memory locality optimizations Address space organization Lecture 2 Memory locality optimizations Address space organization Announcements Office hours in EBU3B Room 3244 Mondays 3.00 to 4.00pm; Thurs 2:00pm-3:30pm Partners XSED Portal accounts Log in to Lilliput

More information

Vector Processors. Kavitha Chandrasekar Sreesudhan Ramkumar

Vector Processors. Kavitha Chandrasekar Sreesudhan Ramkumar Vector Processors Kavitha Chandrasekar Sreesudhan Ramkumar Agenda Why Vector processors Basic Vector Architecture Vector Execution time Vector load - store units and Vector memory systems Vector length

More information

Implementing AES : performance and security challenges

Implementing AES : performance and security challenges Implementing AES 2000-2010: performance and security challenges Emilia Käsper Katholieke Universiteit Leuven SPEED-CC Berlin, October 2009 Emilia Käsper Implementing AES 2000-2010 1/ 31 1 The AES Performance

More information

Cell Programming Tips & Techniques

Cell Programming Tips & Techniques Cell Programming Tips & Techniques Course Code: L3T2H1-58 Cell Ecosystem Solutions Enablement 1 Class Objectives Things you will learn Key programming techniques to exploit cell hardware organization and

More information

! An alternate classification. Introduction. ! Vector architectures (slides 5 to 18) ! SIMD & extensions (slides 19 to 23)

! An alternate classification. Introduction. ! Vector architectures (slides 5 to 18) ! SIMD & extensions (slides 19 to 23) Master Informatics Eng. Advanced Architectures 2015/16 A.J.Proença Data Parallelism 1 (vector, SIMD ext., GPU) (most slides are borrowed) Instruction and Data Streams An alternate classification Instruction

More information

Online Course Evaluation. What we will do in the last week?

Online Course Evaluation. What we will do in the last week? Online Course Evaluation Please fill in the online form The link will expire on April 30 (next Monday) So far 10 students have filled in the online form Thank you if you completed it. 1 What we will do

More information

Performance in the Multicore Era

Performance in the Multicore Era Performance in the Multicore Era Gustavo Alonso Systems Group -- ETH Zurich, Switzerland Systems Group Enterprise Computing Center Performance in the multicore era 2 BACKGROUND - SWISSBOX SwissBox: An

More information

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining Several Common Compiler Strategies Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining Basic Instruction Scheduling Reschedule the order of the instructions to reduce the

More information

High Performance Computing: Tools and Applications

High Performance Computing: Tools and Applications High Performance Computing: Tools and Applications Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology Lecture 8 Processor-level SIMD SIMD instructions can perform

More information

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines CS450/650 Notes Winter 2013 A Morton Superscalar Pipelines 1 Scalar Pipeline Limitations (Shen + Lipasti 4.1) 1. Bounded Performance P = 1 T = IC CPI 1 cycletime = IPC frequency IC IPC = instructions per

More information

EJEMPLOS DE ARQUITECTURAS

EJEMPLOS DE ARQUITECTURAS Maestría en Electrónica Arquitectura de Computadoras Unidad 4 EJEMPLOS DE ARQUITECTURAS M. C. Felipe Santiago Espinosa Marzo/2017 ARM & MIPS Similarities ARM: the most popular embedded core Similar basic

More information

Memory access patterns. 5KK73 Cedric Nugteren

Memory access patterns. 5KK73 Cedric Nugteren Memory access patterns 5KK73 Cedric Nugteren Today s topics 1. The importance of memory access patterns a. Vectorisation and access patterns b. Strided accesses on GPUs c. Data re-use on GPUs and FPGA

More information

CS4961 Parallel Programming. Lecture 7: Introduction to SIMD 09/14/2010. Homework 2, Due Friday, Sept. 10, 11:59 PM. Mary Hall September 14, 2010

CS4961 Parallel Programming. Lecture 7: Introduction to SIMD 09/14/2010. Homework 2, Due Friday, Sept. 10, 11:59 PM. Mary Hall September 14, 2010 Parallel Programming Lecture 7: Introduction to SIMD Mary Hall September 14, 2010 Homework 2, Due Friday, Sept. 10, 11:59 PM To submit your homework: - Submit a PDF file - Use the handin program on the

More information

Instruction Level Parallelism

Instruction Level Parallelism Instruction Level Parallelism Software View of Computer Architecture COMP2 Godfrey van der Linden 200-0-0 Introduction Definition of Instruction Level Parallelism(ILP) Pipelining Hazards & Solutions Dynamic

More information

CMSC 611: Advanced Computer Architecture

CMSC 611: Advanced Computer Architecture CMSC 611: Advanced Computer Architecture Compilers Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Science

More information

Hardware-Sensitive Database Operations

Hardware-Sensitive Database Operations D B Hardware-Sensitive Database Operations Advanced Topics in Bala Gurumurthy Otto-von-Guericke University Magdeburg Summer 2018 Credits Parts of this lecture are based on content by Jens Teubner from

More information

Vectorizing Database Column Scans with Complex Predicates

Vectorizing Database Column Scans with Complex Predicates Vectorizing Database Column Scans with Complex Predicates Thomas Willhalm, Ismail Oukid, Ingo Müller, Franz Faerber thomas.willhalm@intel.com, i.oukid@sap.com, ingo.mueller@kit.edu, franz.faerber@sap.com

More information

CSC2/458 Parallel and Distributed Systems Machines and Models

CSC2/458 Parallel and Distributed Systems Machines and Models CSC2/458 Parallel and Distributed Systems Machines and Models Sreepathi Pai January 23, 2018 URCS Outline Recap Scalability Taxonomy of Parallel Machines Performance Metrics Outline Recap Scalability Taxonomy

More information

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University EE382A Lecture 7: Dynamic Scheduling Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 7-1 Announcements Project proposal due on Wed 10/14 2-3 pages submitted

More information

An Introduction to Parallel Systems

An Introduction to Parallel Systems An Introduction to Parallel Lecture 2 - Data Parallelism and Vector Processors University of Bath November 22, 2007 An Introduction to Parallel When Week 1 Introduction Who, What, Why, Where, When? Week

More information

Hakam Zaidan Stephen Moore

Hakam Zaidan Stephen Moore Hakam Zaidan Stephen Moore Outline Vector Architectures Properties Applications History Westinghouse Solomon ILLIAC IV CDC STAR 100 Cray 1 Other Cray Vector Machines Vector Machines Today Introduction

More information

Advanced processor designs

Advanced processor designs Advanced processor designs We ve only scratched the surface of CPU design. Today we ll briefly introduce some of the big ideas and big words behind modern processors by looking at two example CPUs. The

More information

Advanced Computer Architecture Lab 4 SIMD

Advanced Computer Architecture Lab 4 SIMD Advanced Computer Architecture Lab 4 SIMD Moncef Mechri 1 Introduction The purpose of this lab assignment is to give some experience in using SIMD instructions on x86. We will

More information

CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions

CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions CS 61C: Great Ideas in Computer Architecture The Flynn Taxonomy, Intel SIMD Instructions Guest Lecturer: Alan Christopher 3/08/2014 Spring 2014 -- Lecture #19 1 Neuromorphic Chips Researchers at IBM and

More information

GPU Acceleration of Matrix Algebra. Dr. Ronald C. Young Multipath Corporation. fmslib.com

GPU Acceleration of Matrix Algebra. Dr. Ronald C. Young Multipath Corporation. fmslib.com GPU Acceleration of Matrix Algebra Dr. Ronald C. Young Multipath Corporation FMS Performance History Machine Year Flops DEC VAX 1978 97,000 FPS 164 1982 11,000,000 FPS 164-MAX 1985 341,000,000 DEC VAX

More information