Exercise Session 6. Data Processing on Modern Hardware L Fall Semester Cagri Balkesen
|
|
- Verity Underwood
- 6 years ago
- Views:
Transcription
1 Cagri Balkesen Data Processing on Modern Hardware Exercises Fall Exercise Session 6 Data Processing on Modern Hardware L Fall Semester 2012 Cagri Balkesen cagri.balkesen@inf.ethz.ch Department of Computer Science ETH Zurich, Switzerland 25 October 2012
2 Cagri Balkesen Data Processing on Modern Hardware Exercises Fall Flynn s Taxonomy Computer architecture classification according to Flynn [Fly72]: SMD: nstructions Control Unit SSD: Single nstruction Single Data Stream Data Processing Unit #4 Result SMD: Single nstruction Multiple Data Streams Data Processing Unit #3 Result MSD: Mutiple nstruction Single Data Stream Data Processing Unit #2 Result MMD: Multipe nstruction Multiple Data Streams Data Processing Unit #1 Result
3 Cagri Balkesen Data Processing on Modern Hardware Exercises Fall Early SMD Machines CDC STAR-100 Released 1974 Vector super computer supporting memory-to-memory vector operations Cray X-MP/28 (CAB) ntroduction 1982 Word length: 64 bit Memory: 8M words Register-to-register vector operations 8 vector registes with up to 64 words each ETH Cray X-MP/28 (CAB)
4 - - VO Variable Vector Length on a Cray X-MP Example: Vector integer addition of first 53 elements of two vector registers V1, V2, V3: Vector Register V3[0:52] V1[0:52] + V2[0:52] Line nstr. Description 1 A1 53 Set addr. reg. A1 to 53 2 VL A1 Set vector length to 53 3 V3 V1+V2 Perform addition Latency of add: VL+8 = 61 cycles (1 cycle 9.5 ns) Basic Operation of the Vector Section Vector Registers i=. [[ v7 V6 [ v5! v4 ((A0)+(Ak)) v3 V2 ((A0)+(Ak)) Vl -- ((A0)+(Ak)) o" Vector Control t Vector Length ] Ai source [RR89] Vector Control t Vector Mask Ak 4 Vj Vk Vi Vector Functional Units F Ak ll [ Shift ].T s,, S-! Si Sj Sh ' ~ Pop/Parity!, F~ll Vectorilt ' i Logi~, ' -'~ S~ nd ll" Logical U ' Add Floating Point Functional Units V. - -1, ~oipro--l[ V~l ] Approx. [.' ] Multiply ] ] Vs Ski! Add ~ i ii The Vector Pop/Paxity unit The Second Vector Logical unit shares shares its input path with the both its input and output paths with [ Reciprocal Approximation unit. the Floating Point Multiply unit. [ i i Figure 5.1. The vector section of the Cray X-MP. Cagri Balkesen Data Processing on Modern Hardware Exercises Fall
5 Cagri Balkesen Data Processing on Modern Hardware Exercises Fall Vector Units in Modern CPUs PowerPC AltiVec (Motorola/Freescale) VMX (BM), Velocity Engine (Apple) PowerPC G4, G5 and Cell BE bit vector registers Dedicated vector unit UltraSPARC AMD VS: Visual nstruction Set Uses 64-bit FPU registers 3D Now! (since AMD K6-2) nteger + single precision floating point ntel MMX Since Pentium MMX 8 64-bit registers (alias to FPU stack) Before Pentium no vector unit SSE SSE4 ntroduced in Pentium Dedicated vector unit, combined with MMX AVX 256-bit registers 3 operand, non-destructive instructions
6 ntel SSE SSE SSE2 Since Pentium bit vector registers xmm0,..., xmm7 Single precision FP Dedicated vector unit but shared resources with FPU Since Pentium V Double precision FP Extends MMX registers to 128 bit Full integer support on XMM registers (without using MMX registers) SSE3 SSSE3 SSE4 x86-64 AVX Since Pentium V (Prescott) New horizontal operations Since Core 2 (Merom) New permutaiton instructions Since Core 2 (Penryn) Dot product Adds additional vector registers xmm8,..., xmm bits ymm8,..., ymm15 Cagri Balkesen Data Processing on Modern Hardware Exercises Fall
7 Cagri Balkesen Data Processing on Modern Hardware Exercises Fall SSE2 Vector Registers nteger Data Types 128 bit vector register (16 bytes) 16 byte elements b 15 b 14 b 13 b 12 b 11 b 10 b 9 b 8 b 7 b 6 b 5 b 4 b 3 b 2 b 1 b 0 8 word elements w 7 w 6 w 5 w 4 w 3 w 2 w 1 w 0 4 double word elements dw 3 dw 2 dw 1 dw 0 2 quad word elements qw 1 qw 0 Floating Point Data Types 4 float elements f 3 f 2 f 1 f 0 2 double elements d 1 d 0
8 Cagri Balkesen Data Processing on Modern Hardware Exercises Fall Assignment: Warm-up Exercise Download skeleton C code from course website compareandcount.c does the following: Array is filled with 100 mio numbers [0, 99] Counts how many values > 42 mplement SMD-accelerated version using SSE ntrinsics Measure speed-up Caveats: Result of SMD-comparison 0 if true but all 1 s if false n order to count, either shift or exploit that = 1
9 Cagri Balkesen Data Processing on Modern Hardware Exercises Fall Assignment: Column Value De(compression) n file compression.c the following compression is implemented 32-to-8 bit 32-to-9 bit 32-to-7 bit Use C macros to switch between versions Serial decompression is implemented Execution time is measured and decompressed values validated mplement the following functions using SSE ntrinsics SMD_decompress8to32(... ) SMD_decompress9to32(... ) SMD_decompress7to32(... )
10 Cagri Balkesen Data Processing on Modern Hardware Exercises Fall Decompression Step 1: Copy Values Step 1: Bring data into proper 32-bit words: v 13 v 12 v 11 v 10 v 9 v 8 v 7 v 6 v 5 v 4 v 3 v 2 v 1 v 0 shuffle mask FF FF 4 3 FF FF 3 2 FF FF 2 1 FF FF 1 0 v 3 v 2 v 1 v 0 Use shuffle instructions to move bytes within SMD registers. m128i out = _mm_shuffle_epi8 (in, shufmask);
11 Cagri Balkesen Data Processing on Modern Hardware Exercises Fall Decompression Step 2: Establish Same Bit Alignment Step 2: Make all four words identically bit-aligned: v 3 v 2 v 1 v 0 3 bits 2 bits 1 bits 0 bits shift 0 bits shift 1 bits shift 2 bits shift 3 bits v 3 v 2 v 1 v 0 3 bits 3 bits 3 bits 3 bits SMD shift instructions do not support variable shift amounts!
12 Cagri Balkesen Data Processing on Modern Hardware Exercises Fall Decompression Step 3: Shift and Mask Step 3: Word-align data and mask out invalid bits: v 3 v 2 v 1 v 0 v 3 v 2 v 1 v 0 v 3 v 2 v 1 v 0 m128i shifted = _mm_srli_epi32 (in, 3); m128i result = _mm_and_si128 (shifted, maskval);
13 Cagri Balkesen Data Processing on Modern Hardware Exercises Fall Useful collection of SSE ntrinsics ntrinsic Function _mm_loadu_si128(src) _mm_storeu_si128(dest, reg) _mm_set_epi32(v0,v1,v2,v3) _mm_set_epi8(v0,...,v15) _mm_cmpgt_epi32(reg1, reg2) _mm_add_epi32(reg1, reg2) _mm_and_si128(reg, mask) _mm_mullo_epi32(reg1, reg2) _mm_extract_epi32(reg, pos) _mm_shuffle_epi32(reg, mask) _mm_srli_si128(reg, bytecnt) _mm_slli_si128(reg, bytecnt) _mm_srli_epi32(reg, bitcnt) _mm_slli_epi32(reg, bitcnt) Description Load data from memory into register Store data back in memory Load four 32-bit integers into register Load sixteen 8-bit integers into register Greater compare of the four 32-bit values in the registers Addition of the four 32-bit values in the registers Bitwise and of two registers masking Multiplication of the four 32-bit values in the registers Extract 32-bit integer at position pos form register Shuffle 32-bit integers according to the shuffle mask Shift entire register right (bytewise) Shift entire register left (bytewise) Shift 32-bit integers in register to the right (bitwise) Shift 32-bit integers in register to the left (bitwise) Rerfer to for details.
14 Cagri Balkesen Data Processing on Modern Hardware Exercises Fall Auto-Vectorization in gcc Recent versions of gcc can auto-vectorize C code Use command line options: -ftree-vectorize turn on auto-vectorization (default for -O3) -ftree-vectorizer-verbose=x set reporting verbosity level of vectorizer -msse to generate SSE code -msse2 to generate SSE2 code -msse3 to generate SSE3 code gcc does not vectorize if code contains braches. unconstrained pointers are used (aliasing). uncountable loops.
15 Cagri Balkesen Data Processing on Modern Hardware Exercises Fall Auto-Vectorization of Volcano Column-Store mplementation of artithmetic operator in engine.c: int next_arithop(int32_t *tuples, arithop_t *op) {... switch (op->operation) { case ADD: for (n=0; n<minnum; n++) tuples[n] = op->left_input[n]+ op->right_input[n]; break; case SUB: for (n=0; n<minnum; n++) { tuples[n] = op->left_input[n]- op->right_input[n]; break;... Compilation: $ gcc -m64 -O3 -msse2 -ftree-vectorize -ftree-vectorizer-verbose=3 -c engine.c... engine.c:104: note:not vectorized: unhandled data-ref engine.c:109: note:not vectorizer: unhandled data-ref... engine.c:95: note: vectorized 0 loops in function. Auto-vectorization failed op pointer, possible aliasing? restrict pointer
16 Cagri Balkesen Data Processing on Modern Hardware Exercises Fall Vectorized Volcano Column-Store using SSE Using restricted pointers in loops: int next_arithop(int32_t *tuples, arithop_t *op) {... int32_t* restrict l=op->left_input; int32_t* restrict r=op->right_input;... switch (op->operation) { case ADD: for (n=0; n<minnum; n++) tuples[n] = l[n]+r[n]; break; case SUB: for (n=0; n<minnum; n++) { tuples[n] = l[n]-r[n]; break;... Compilation: $ gcc -m64 -O3 -msse2 -ftree-vectorize -ftree-vectorizer-verbose=3 -c engine.c... engine.c:113: note: Alignment of access forced using peeling. engine.c:113: note: LOOP VECTORZED engine.c:121: note: Alignment of access forced using peeling. engine.c:121: note: LOOP VECTORZED... engine.c:102: note: vectorized 5 loops in function. successful vectorization
17 Cagri Balkesen Data Processing on Modern Hardware Exercises Fall Speedup SSE Vectorization of Volcano Column-Store Query: SELECT sum(orderkey+linenumber*shipdate) FROM lineitems data set: 6 million rows CPU: Core 2 Quad Q GHz non-vectorized ms gcc autovectorization ms Speedup = 1.10 Load is memory-bound, not CPU-bound.
18 Cagri Balkesen Data Processing on Modern Hardware Exercises Fall Hand in your Results to Subject = DPMH:assignment4 {your netzname} Body = Description of CPU you tested on, e.g., Xeon QuadCore L5520,( 4x 2267MHz) Attach plots + raw data Attach source code (optional)
19 References [Fly72] [RR89] Michael J. Flynn. Some computer organizations and their effectiveness. EEE Transactions on Computers, 21(9): , September Kay A. Robbins and Steven Robbins. The Cray X-MP/Model 24, chapter 5, pages Springer LNCS, [WPB + 09] Thomas Willhalm, Nicolae Popovici, Yazan Boshmaf, Hasso Plattner, Alexander Zeier, and Jan Schaffner. [ZR02] Simd-scan: Ultra fast in-memory table scan using on-chip vector processing units. PVLDB, 2(1): , Jingren Zhou and Kenneth A. Ross. mplementing database operations using SMD instructions. n SGMOD 02, pages , Madison, Wisconsin, USA, Cagri Balkesen Data Processing on Modern Hardware Exercises Fall
Dan Stafford, Justine Bonnot
Dan Stafford, Justine Bonnot Background Applications Timeline MMX 3DNow! Streaming SIMD Extension SSE SSE2 SSE3 and SSSE3 SSE4 Advanced Vector Extension AVX AVX2 AVX-512 Compiling with x86 Vector Processing
More informationExercise Session 5. Data Processing on Modern Hardware L Fall Semester Cagri Balkesen
Cagri Balkesen Data Processing on Modern Hardware Exercises Fall 2012 1 Exercise Session 5 Data Processing on Modern Hardware 263-3502-00L Fall Semester 2012 Cagri Balkesen cagri.balkesen@inf.ethz.ch Department
More informationPipelining and Vector Processing
Pipelining and Vector Processing Chapter 8 S. Dandamudi Outline Basic concepts Handling resource conflicts Data hazards Handling branches Performance enhancements Example implementations Pentium PowerPC
More informationHigh Performance Computing and Programming 2015 Lab 6 SIMD and Vectorization
High Performance Computing and Programming 2015 Lab 6 SIMD and Vectorization 1 Introduction The purpose of this lab assignment is to give some experience in using SIMD instructions on x86 and getting compiler
More informationOpenCL Vectorising Features. Andreas Beckmann
Mitglied der Helmholtz-Gemeinschaft OpenCL Vectorising Features Andreas Beckmann Levels of Vectorisation vector units, SIMD devices width, instructions SMX, SP cores Cus, PEs vector operations within kernels
More informationECE 571 Advanced Microprocessor-Based Design Lecture 4
ECE 571 Advanced Microprocessor-Based Design Lecture 4 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 January 2016 Homework #1 was due Announcements Homework #2 will be posted
More information( ZIH ) Center for Information Services and High Performance Computing. Overvi ew over the x86 Processor Architecture
( ZIH ) Center for Information Services and High Performance Computing Overvi ew over the x86 Processor Architecture Daniel Molka Ulf Markwardt Daniel.Molka@tu-dresden.de ulf.markwardt@tu-dresden.de Outline
More informationSWAR: MMX, SSE, SSE 2 Multiplatform Programming
SWAR: MMX, SSE, SSE 2 Multiplatform Programming Relatore: dott. Matteo Roffilli roffilli@csr.unibo.it 1 What s SWAR? SWAR = SIMD Within A Register SIMD = Single Instruction Multiple Data MMX,SSE,SSE2,Power3DNow
More informationSIMD Instructions outside and inside Oracle 12c. Laurent Léturgez 2016
SIMD Instructions outside and inside Oracle 2c Laurent Léturgez 206 Whoami Oracle Consultant since 200 Former developer (C, Java, perl, PL/SQL) Owner@Premiseo: Data Management on Premise and in the Cloud
More informationSSE and SSE2. Timothy A. Chagnon 18 September All images from Intel 64 and IA 32 Architectures Software Developer's Manuals
SSE and SSE2 Timothy A. Chagnon 18 September 2007 All images from Intel 64 and IA 32 Architectures Software Developer's Manuals Overview SSE: Streaming SIMD (Single Instruction Multiple Data) Extensions
More informationData-Parallel Execution using SIMD Instructions
Data-Parallel Execution using SIMD Instructions 1 / 26 Single Instruction Multiple Data data parallelism exposed by the instruction set CPU register holds multiple fixed-size values (e.g., 4 times 32-bit)
More informationMasterpraktikum Scientific Computing
Masterpraktikum Scientific Computing High-Performance Computing Michael Bader Alexander Heinecke Technische Universität München, Germany Outline Logins Levels of Parallelism Single Processor Systems Von-Neumann-Principle
More informationParallel Processing SIMD, Vector and GPU s
Parallel Processing SIMD, Vector and GPU s EECS4201 Fall 2016 York University 1 Introduction Vector and array processors Chaining GPU 2 Flynn s taxonomy SISD: Single instruction operating on Single Data
More informationCSE 160 Lecture 10. Instruction level parallelism (ILP) Vectorization
CSE 160 Lecture 10 Instruction level parallelism (ILP) Vectorization Announcements Quiz on Friday Signup for Friday labs sessions in APM 2013 Scott B. Baden / CSE 160 / Winter 2013 2 Particle simulation
More informationParallel Processing SIMD, Vector and GPU s
Parallel Processing SIMD, ector and GPU s EECS4201 Comp. Architecture Fall 2017 York University 1 Introduction ector and array processors Chaining GPU 2 Flynn s taxonomy SISD: Single instruction operating
More informationMost of the slides in this lecture are either from or adapted from slides provided by the authors of the textbook Computer Systems: A Programmer s
Most of the slides in this lecture are either from or adapted from slides provided by the authors of the textbook Computer Systems: A Programmer s Perspective, 2 nd Edition and are provided from the website
More informationComputer System Architecture
CSC 203 1.5 Computer System Architecture Budditha Hettige Department of Statistics and Computer Science University of Sri Jayewardenepura Microprocessors 2011 Budditha Hettige 2 Processor Instructions
More informationFigure 1: 128-bit registers introduced by SSE. 128 bits. xmm0 xmm1 xmm2 xmm3 xmm4 xmm5 xmm6 xmm7
SE205 - TD1 Énoncé General Instructions You can download all source files from: https://se205.wp.mines-telecom.fr/td1/ SIMD-like Data-Level Parallelism Modern processors often come with instruction set
More informationJignesh M. Patel. Blog:
Jignesh M. Patel Blog: http://bigfastdata.blogspot.com Go back to the design Query Cache from Processing for Conscious 98s Modern (at Algorithms Hardware least for Hash Joins) 995 24 2 Processor Processor
More informationComputer System Architecture
CSC 203 1.5 Computer System Architecture Department of Statistics and Computer Science University of Sri Jayewardenepura Instruction Set Architecture (ISA) Level 2 Introduction 3 Instruction Set Architecture
More informationSIMD Programming CS 240A, 2017
SIMD Programming CS 240A, 2017 1 Flynn* Taxonomy, 1966 In 2013, SIMD and MIMD most common parallelism in architectures usually both in same system! Most common parallel processing programming style: Single
More informationIntel released new technology call P6P
P6 and IA-64 8086 released on 1978 Pentium release on 1993 8086 has upgrade by Pipeline, Super scalar, Clock frequency, Cache and so on But 8086 has limit, Hard to improve efficiency Intel released new
More informationAlex Bennée stsquad on #qemu Virtualization Linaro Projects: QEMU, KVM, ARM 2. 1
VECTORS MEET VIRTUALIZATION ALEX BENNÉE FOSDEM 2018 1 INTRODUCTION Alex Bennée alex.bennee@linaro.org stsquad on #qemu Virtualization Developer @ Linaro Projects: QEMU, KVM, ARM 2. 1 WHAT IS QEMU? From:
More informationIntel Enterprise Processors Technology
Enterprise Processors Technology Kosuke Hirano Enterprise Platforms Group March 20, 2002 1 Agenda Architecture in Enterprise Xeon Processor MP Next Generation Itanium Processor Interconnect Technology
More informationParallelized Progressive Network Coding with Hardware Acceleration
Parallelized Progressive Network Coding with Hardware Acceleration Hassan Shojania, Baochun Li Department of Electrical and Computer Engineering University of Toronto Network coding Information is coded
More informationIntel 64 and IA-32 Architectures Software Developer s Manual
Intel 64 and IA-32 Architectures Software Developer s Manual Volume 1: Basic Architecture NOTE: The Intel 64 and IA-32 Architectures Software Developer's Manual consists of five volumes: Basic Architecture,
More informationChap. 9 Pipeline and Vector Processing
9-1 Parallel Processing = Simultaneous data processing tasks for the purpose of increasing the computational speed Perform concurrent data processing to achieve faster execution time Multiple Functional
More informationVECTORIZING RECOMPRESSION IN COLUMN-BASED IN-MEMORY DATABASE SYSTEMS
Dep. of Computer Science Institute for System Architecture, Database Technology Group Master Thesis VECTORIZING RECOMPRESSION IN COLUMN-BASED IN-MEMORY DATABASE SYSTEMS Cheng Chen Matr.-Nr.: 3924687 Supervised
More informationInstruction Set extensions to X86. Floating Point SIMD instructions
Instruction Set extensions to X86 Some extensions to x86 instruction set intended to accelerate 3D graphics AMD 3D-Now! Instructions simply accelerate floating point arithmetic. Accelerate object transformations
More informationEC 513 Computer Architecture
EC 513 Computer Architecture Complex Pipelining: Superscalar Prof. Michel A. Kinsy Summary Concepts Von Neumann architecture = stored-program computer architecture Self-Modifying Code Princeton architecture
More informationIntel 64 and IA-32 Architectures Software Developer s Manual
Intel 64 and IA-32 Architectures Software Developer s Manual Volume 1: Basic Architecture NOTE: The Intel 64 and IA-32 Architectures Software Developer's Manual consists of seven volumes: Basic Architecture,
More informationEITF20: Computer Architecture Part2.1.1: Instruction Set Architecture
EITF20: Computer Architecture Part2.1.1: Instruction Set Architecture Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Instruction Set Principles The Role of Compilers MIPS 2 Main Content Computer
More informationA study on SIMD architecture
A study on SIMD architecture Gürkan Solmaz, Rouhollah Rahmatizadeh and Mohammad Ahmadian Department of Electrical Engineering and Computer Science University of Central Florida Email: {gsolmaz,rrahmati,mohammad}@knights.ucf.edu
More informationFundamentals of Computer Design
CS359: Computer Architecture Fundamentals of Computer Design Yanyan Shen Department of Computer Science and Engineering 1 Defining Computer Architecture Agenda Introduction Classes of Computers 1.3 Defining
More informationUsing Intel Streaming SIMD Extensions for 3D Geometry Processing
Using Intel Streaming SIMD Extensions for 3D Geometry Processing Wan-Chun Ma, Chia-Lin Yang Dept. of Computer Science and Information Engineering National Taiwan University firebird@cmlab.csie.ntu.edu.tw,
More informationEfficient Decoding of Posting Lists with SIMD Instructions
Journal of Computational Information Systems 11: 24 (2015) 7747 7755 Available at http://www.jofcis.com Efficient Decoding of Posting Lists with SIMD Instructions Naiyong AO 1, Xiaoguang LIU 2, Gang WANG
More informationEN164: Design of Computing Systems Lecture 24: Processor / ILP 5
EN164: Design of Computing Systems Lecture 24: Processor / ILP 5 Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University
More informationImproving the Addweighted Function in OpenCV 3.0 Using SSE and AVX Intrinsics
Improving the Addweighted Function in OpenCV 3.0 Using SSE and AVX Intrinsics Panyayot Chaikan and Somsak Mitatha Abstract This paper presents a new algorithm for improving the speed of OpenCV s addeighted
More informationlast time out-of-order execution and instruction queues the data flow model idea
1 last time 2 out-of-order execution and instruction queues the data flow model idea graph of operations linked by depedencies latency bound need to finish longest dependency chain multiple accumulators
More informationLecture 8. Vector Processing. 8.1 Flynn s taxonomy. M. J. Flynn proposed a categorization of parellel computer systems in 1966.
Lecture 8 Vector Processing 8.1 Flynn s taxonomy M. J. Flynn proposed a categorization of parellel computer systems in 1966. Single Instruction, Single Data stream (SISD) Single Instruction, Multiple Data
More informationInstruction Set Principles and Examples. Appendix B
Instruction Set Principles and Examples Appendix B Outline What is Instruction Set Architecture? Classifying ISA Elements of ISA Programming Registers Type and Size of Operands Addressing Modes Types of
More informationSIMD. Utilization of a SIMD unit in the OS Kernel. Shogo Saito 1 and Shuichi Oikawa 2 2. SIMD. SIMD (Single SIMD SIMD SIMD SIMD
OS SIMD 1 2 SIMD (Single Instruction Multiple Data) SIMD OS (Operating System) SIMD SIMD OS Utilization of a SIMD unit in the OS Kernel Shogo Saito 1 and Shuichi Oikawa 2 Nowadays, it is very common that
More informationSIMD Exploitation in (JIT) Compilers
SIMD Exploitation in (JIT) Compilers Hiroshi Inoue, IBM Research - Tokyo 1 What s SIMD? Single Instruction Multiple Data Same operations applied for multiple elements in a vector register input 1 A0 input
More informationTechnical Report. Research Lab: LERIA
Technical Report Improvement of Fitch function for Maximum Parsimony in Phylogenetic Reconstruction with Intel AVX2 assembler instructions Research Lab: LERIA TR20130624-1 Version 1.0 24 June 2013 JEAN-MICHEL
More informationComputer Science 324 Computer Architecture Mount Holyoke College Fall Topic Notes: MIPS Instruction Set Architecture
Computer Science 324 Computer Architecture Mount Holyoke College Fall 2007 Topic Notes: MIPS Instruction Set Architecture vonneumann Architecture Modern computers use the vonneumann architecture. Idea:
More informationLecture 16 SSE vectorprocessing SIMD MultimediaExtensions
Lecture 16 SSE vectorprocessing SIMD MultimediaExtensions Improving performance with SSE We ve seen how we can apply multithreading to speed up the cardiac simulator But there is another kind of parallelism
More informationCPE300: Digital System Architecture and Design
CPE300: Digital System Architecture and Design Fall 2011 MW 17:30-18:45 CBC C316 Arithmetic Unit 10032011 http://www.egr.unlv.edu/~b1morris/cpe300/ 2 Outline Recap Chapter 3 Number Systems Fixed Point
More informationComputer Architecture and Organization
10-1 Chapter 10 - Advanced Computer Architecture Computer Architecture and Organization Miles Murdocca and Vincent Heuring Chapter 10 Advanced Computer Architecture 10-2 Chapter 10 - Advanced Computer
More informationFAST FIR FILTERS FOR SIMD PROCESSORS WITH LIMITED MEMORY BANDWIDTH
Key words: Digital Signal Processing, FIR filters, SIMD processors, AltiVec. Grzegorz KRASZEWSKI Białystok Technical University Department of Electrical Engineering Wiejska
More informationStorage I/O Summary. Lecture 16: Multimedia and DSP Architectures
Storage I/O Summary Storage devices Storage I/O Performance Measures» Throughput» Response time I/O Benchmarks» Scaling to track technological change» Throughput with restricted response time is normal
More informationChapter 06: Instruction Pipelining and Parallel Processing. Lesson 14: Example of the Pipelined CISC and RISC Processors
Chapter 06: Instruction Pipelining and Parallel Processing Lesson 14: Example of the Pipelined CISC and RISC Processors 1 Objective To understand pipelines and parallel pipelines in CISC and RISC Processors
More informationChapter 1. Computer Abstractions and Technology. Lesson 3: Understanding Performance
Chapter 1 Computer Abstractions and Technology Lesson 3: Understanding Performance Manufacturing ICs 1.7 Real Stuff: The AMD Opteron X4 Yield: proportion of working dies per wafer Chapter 1 Computer Abstractions
More informationCSCI 402: Computer Architectures. Instructions: Language of the Computer (4) Fengguang Song Department of Computer & Information Science IUPUI
CSCI 402: Computer Architectures Instructions: Language of the Computer (4) Fengguang Song Department of Computer & Information Science IUPUI op Instruction address 6 bits 26 bits Jump Addressing J-type
More informationAMD Opteron TM & PGI: Enabling the Worlds Fastest LS-DYNA Performance
3. LS-DY Anwenderforum, Bamberg 2004 CAE / IT II AMD Opteron TM & PGI: Enabling the Worlds Fastest LS-DY Performance Tim Wilkens Ph.D. Member of Technical Staff tim.wilkens@amd.com October 4, 2004 Computation
More informationLecture 4. Instruction Level Parallelism Vectorization, SSE Optimizing for the memory hierarchy
Lecture 4 Instruction Level Parallelism Vectorization, SSE Optimizing for the memory hierarchy Partners? Announcements Scott B. Baden / CSE 160 / Winter 2011 2 Today s lecture Why multicore? Instruction
More informationComputer Architecture: SIMD and GPUs (Part I) Prof. Onur Mutlu Carnegie Mellon University
Computer Architecture: SIMD and GPUs (Part I) Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-447 Spring 2013, Computer Architecture, Lecture 15: Dataflow
More informationCOE608: Computer Organization and Architecture
Add on Instruction Set Architecture COE608: Computer Organization and Architecture Dr. Gul N. Khan http://www.ee.ryerson.ca/~gnkhan Electrical and Computer Engineering Ryerson University Overview More
More informationIntel SIMD. Chris Phillips LBA Lead Scien-st November 2014 ASTRONOMY AND SPACE SCIENCE
Intel SIMD Chris Phillips LBA Lead Scien-st November 2014 ASTRONOMY AND SPACE SCIENCE SIMD Single Instruc-on Mul-ple Data Vector extensions for x86 processors Parallel opera-ons More registers than regular
More informationEITF20: Computer Architecture Part2.1.1: Instruction Set Architecture
EITF20: Computer Architecture Part2.1.1: Instruction Set Architecture Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Instruction Set Principles The Role of Compilers MIPS 2 Main Content Computer
More informationEE 457. EE 457 Unit 0. Prerequisites. Course Info Lecture: Prof. Redekopp Class Introduction Basic Hardware Organization
0.1 0.2 EE 457 EE 457 Unit 0 Class Introduction Basic Hardware Organization Focus on CPU Design Microarchitecture General Digital System Design Focus on Hierarchy Cache Virtual Focus on Computer Arithmetic
More informationExploiting automatic vectorization to employ SPMD on SIMD registers
Exploiting automatic vectorization to employ SPMD on SIMD registers Stefan Sprenger Department of Computer Science Humboldt-Universität zu Berlin Berlin, Germany sprengsz@informatik.hu-berlin.de Steffen
More informationSIMD: Data parallel execution
ERLANGEN REGIONAL COMPUTING CENTER SIMD: Data parallel execution J. Eitzinger HLRS, 15.6.2018 CPU Stored Program Computer: Base setting Memory for (int j=0; j
More informationCS 261 Fall Mike Lam, Professor. x86-64 Data Structures and Misc. Topics
CS 261 Fall 2017 Mike Lam, Professor x86-64 Data Structures and Misc. Topics Topics Homogeneous data structures Arrays Nested / multidimensional arrays Heterogeneous data structures Structs / records Unions
More informationHistory of the Intel 80x86
Intel s IA-32 Architecture Cptr280 Dr Curtis Nelson History of the Intel 80x86 1971 - Intel invents the microprocessor, the 4004 1975-8080 introduced 8-bit microprocessor 1978-8086 introduced 16 bit microprocessor
More informationEN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design
EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown
More informationArchitecture I. Computer Systems Laboratory Sungkyunkwan University
MIPS Instruction ti Set Architecture I Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Architecture (1) the attributes of a system as seen by the
More informationLecture 25: Interrupt Handling and Multi-Data Processing. Spring 2018 Jason Tang
Lecture 25: Interrupt Handling and Multi-Data Processing Spring 2018 Jason Tang 1 Topics Interrupt handling Vector processing Multi-data processing 2 I/O Communication Software needs to know when: I/O
More informationInstruction Set Progression. from MMX Technology through Streaming SIMD Extensions 2
Instruction Set Progression from MMX Technology through Streaming SIMD Extensions 2 This article summarizes the progression of change to the instruction set in the Intel IA-32 architecture, from MMX technology
More informationData-Level Parallelism in SIMD and Vector Architectures. Advanced Computer Architectures, Laura Pozzi & Cristina Silvano
Data-Level Parallelism in SIMD and Vector Architectures Advanced Computer Architectures, Laura Pozzi & Cristina Silvano 1 Current Trends in Architecture Cannot continue to leverage Instruction-Level parallelism
More informationCSCI 402: Computer Architectures
CSCI 402: Computer Architectures Arithmetic for Computers (5) Fengguang Song Department of Computer & Information Science IUPUI What happens when the exact result is not any floating point number, too
More informationGuy Blank Intel Corporation, Israel March 27-28, 2017 European LLVM Developers Meeting Saarland Informatics Campus, Saarbrücken, Germany
Guy Blank Intel Corporation, Israel March 27-28, 2017 European LLVM Developers Meeting Saarland Informatics Campus, Saarbrücken, Germany Motivation C AVX2 AVX512 New instructions utilized! Scalar performance
More informationImage Processing Acceleration Techniques using Intel Streaming SIMD Extensions and Intel Advanced Vector Extensions
Image Processing Acceleration Techniques using Intel Streaming SIMD Extensions and Intel Advanced Vector Extensions September 4, 2009 Authors: Petter Larsson & Eric Palmer INFORMATION IN THIS DOCUMENT
More informationParallelism and Performance Instructor: Steven Ho
Parallelism and Performance Instructor: Steven Ho Review of Last Lecture Cache Performance AMAT = HT + MR MP 2 Multilevel Cache Diagram Main Memory Legend: Request for data Return of data CPU L1$ Memory
More informationILP Limit: Perfect/Infinite Hardware. Chapter 3: Limits of Instr Level Parallelism. ILP Limit: see Figure in book. Narrow Window Size
Chapter 3: Limits of Instr Level Parallelism Ultimately, how much instruction level parallelism is there? Consider study by Wall (summarized in H & P) First, assume perfect/infinite hardware Then successively
More informationCOSC 6385 Computer Architecture. Instruction Set Architectures
COSC 6385 Computer Architecture Instruction Set Architectures Spring 2012 Instruction Set Architecture (ISA) Definition on Wikipedia: Part of the Computer Architecture related to programming Defines set
More informationLecture 2. Memory locality optimizations Address space organization
Lecture 2 Memory locality optimizations Address space organization Announcements Office hours in EBU3B Room 3244 Mondays 3.00 to 4.00pm; Thurs 2:00pm-3:30pm Partners XSED Portal accounts Log in to Lilliput
More informationVector Processors. Kavitha Chandrasekar Sreesudhan Ramkumar
Vector Processors Kavitha Chandrasekar Sreesudhan Ramkumar Agenda Why Vector processors Basic Vector Architecture Vector Execution time Vector load - store units and Vector memory systems Vector length
More informationImplementing AES : performance and security challenges
Implementing AES 2000-2010: performance and security challenges Emilia Käsper Katholieke Universiteit Leuven SPEED-CC Berlin, October 2009 Emilia Käsper Implementing AES 2000-2010 1/ 31 1 The AES Performance
More informationCell Programming Tips & Techniques
Cell Programming Tips & Techniques Course Code: L3T2H1-58 Cell Ecosystem Solutions Enablement 1 Class Objectives Things you will learn Key programming techniques to exploit cell hardware organization and
More information! An alternate classification. Introduction. ! Vector architectures (slides 5 to 18) ! SIMD & extensions (slides 19 to 23)
Master Informatics Eng. Advanced Architectures 2015/16 A.J.Proença Data Parallelism 1 (vector, SIMD ext., GPU) (most slides are borrowed) Instruction and Data Streams An alternate classification Instruction
More informationOnline Course Evaluation. What we will do in the last week?
Online Course Evaluation Please fill in the online form The link will expire on April 30 (next Monday) So far 10 students have filled in the online form Thank you if you completed it. 1 What we will do
More informationPerformance in the Multicore Era
Performance in the Multicore Era Gustavo Alonso Systems Group -- ETH Zurich, Switzerland Systems Group Enterprise Computing Center Performance in the multicore era 2 BACKGROUND - SWISSBOX SwissBox: An
More informationSeveral Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining
Several Common Compiler Strategies Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining Basic Instruction Scheduling Reschedule the order of the instructions to reduce the
More informationHigh Performance Computing: Tools and Applications
High Performance Computing: Tools and Applications Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology Lecture 8 Processor-level SIMD SIMD instructions can perform
More informationCS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines
CS450/650 Notes Winter 2013 A Morton Superscalar Pipelines 1 Scalar Pipeline Limitations (Shen + Lipasti 4.1) 1. Bounded Performance P = 1 T = IC CPI 1 cycletime = IPC frequency IC IPC = instructions per
More informationEJEMPLOS DE ARQUITECTURAS
Maestría en Electrónica Arquitectura de Computadoras Unidad 4 EJEMPLOS DE ARQUITECTURAS M. C. Felipe Santiago Espinosa Marzo/2017 ARM & MIPS Similarities ARM: the most popular embedded core Similar basic
More informationMemory access patterns. 5KK73 Cedric Nugteren
Memory access patterns 5KK73 Cedric Nugteren Today s topics 1. The importance of memory access patterns a. Vectorisation and access patterns b. Strided accesses on GPUs c. Data re-use on GPUs and FPGA
More informationCS4961 Parallel Programming. Lecture 7: Introduction to SIMD 09/14/2010. Homework 2, Due Friday, Sept. 10, 11:59 PM. Mary Hall September 14, 2010
Parallel Programming Lecture 7: Introduction to SIMD Mary Hall September 14, 2010 Homework 2, Due Friday, Sept. 10, 11:59 PM To submit your homework: - Submit a PDF file - Use the handin program on the
More informationInstruction Level Parallelism
Instruction Level Parallelism Software View of Computer Architecture COMP2 Godfrey van der Linden 200-0-0 Introduction Definition of Instruction Level Parallelism(ILP) Pipelining Hazards & Solutions Dynamic
More informationCMSC 611: Advanced Computer Architecture
CMSC 611: Advanced Computer Architecture Compilers Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Science
More informationHardware-Sensitive Database Operations
D B Hardware-Sensitive Database Operations Advanced Topics in Bala Gurumurthy Otto-von-Guericke University Magdeburg Summer 2018 Credits Parts of this lecture are based on content by Jens Teubner from
More informationVectorizing Database Column Scans with Complex Predicates
Vectorizing Database Column Scans with Complex Predicates Thomas Willhalm, Ismail Oukid, Ingo Müller, Franz Faerber thomas.willhalm@intel.com, i.oukid@sap.com, ingo.mueller@kit.edu, franz.faerber@sap.com
More informationCSC2/458 Parallel and Distributed Systems Machines and Models
CSC2/458 Parallel and Distributed Systems Machines and Models Sreepathi Pai January 23, 2018 URCS Outline Recap Scalability Taxonomy of Parallel Machines Performance Metrics Outline Recap Scalability Taxonomy
More informationEE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University
EE382A Lecture 7: Dynamic Scheduling Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 7-1 Announcements Project proposal due on Wed 10/14 2-3 pages submitted
More informationAn Introduction to Parallel Systems
An Introduction to Parallel Lecture 2 - Data Parallelism and Vector Processors University of Bath November 22, 2007 An Introduction to Parallel When Week 1 Introduction Who, What, Why, Where, When? Week
More informationHakam Zaidan Stephen Moore
Hakam Zaidan Stephen Moore Outline Vector Architectures Properties Applications History Westinghouse Solomon ILLIAC IV CDC STAR 100 Cray 1 Other Cray Vector Machines Vector Machines Today Introduction
More informationAdvanced processor designs
Advanced processor designs We ve only scratched the surface of CPU design. Today we ll briefly introduce some of the big ideas and big words behind modern processors by looking at two example CPUs. The
More informationAdvanced Computer Architecture Lab 4 SIMD
Advanced Computer Architecture Lab 4 SIMD Moncef Mechri 1 Introduction The purpose of this lab assignment is to give some experience in using SIMD instructions on x86. We will
More informationCS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions
CS 61C: Great Ideas in Computer Architecture The Flynn Taxonomy, Intel SIMD Instructions Guest Lecturer: Alan Christopher 3/08/2014 Spring 2014 -- Lecture #19 1 Neuromorphic Chips Researchers at IBM and
More informationGPU Acceleration of Matrix Algebra. Dr. Ronald C. Young Multipath Corporation. fmslib.com
GPU Acceleration of Matrix Algebra Dr. Ronald C. Young Multipath Corporation FMS Performance History Machine Year Flops DEC VAX 1978 97,000 FPS 164 1982 11,000,000 FPS 164-MAX 1985 341,000,000 DEC VAX
More information