SIMD. Utilization of a SIMD unit in the OS Kernel. Shogo Saito 1 and Shuichi Oikawa 2 2. SIMD. SIMD (Single SIMD SIMD SIMD SIMD

Size: px
Start display at page:

Download "SIMD. Utilization of a SIMD unit in the OS Kernel. Shogo Saito 1 and Shuichi Oikawa 2 2. SIMD. SIMD (Single SIMD SIMD SIMD SIMD"

Transcription

1 OS SIMD 1 2 SIMD (Single Instruction Multiple Data) SIMD OS (Operating System) SIMD SIMD OS Utilization of a SIMD unit in the OS Kernel Shogo Saito 1 and Shuichi Oikawa 2 Nowadays, it is very common that a processor includes a SIMD (Single Instruction Multiple Data) unit in order to accelerate application processing. While a SIMD unit is a part of a processor, it evolves more rapidly than the integer unit of the processor. Since the use of an FPU (Floating Point Unit) and a SIMD unit is basically abandoned from the kernel, there can be places inside the kernel where a SMID unit works effectively to deal with a large amount of data processing. This paper describes our preliminary work to explore the possibility to utilize a SIMD unit in the kernel. We performed preliminary experiments by using UML (User Mode Linux) and show that data copying can be improved. 1 University of Tsukuba 2 University of Tsukuba 1. SIMD (Single Instruction Multiple Data) SIMD SIMD SIMD SIMD OS OS OS FPU (Floating Point Unit) SIMD OS SIMD OS OS SIMD SIMD SIMD OS OS SIMD UML (User Mode Linux) OS SIMD 2. SIMD SIMD SIMD Intel x86 SSE, SSE2, SSE3, AVX ARM NEON SIMD SIMD SIMD 1 c 2012 Information Processing Society of Japan

2 1 2 SIMD SIMD 4 SIMD SIMD SIMD SIMD SIMD SIMD OS SIMD SIMD OS OS SIMD SIMD OS OS SIMD SIMD SIMD AES SIMD SIMD 3. UML (User Mode Linux) SIMD OS OS SIMD OS SIMD UML (User Mode Linux) OS SIMD UML Linux Linux UML SIMD UML SIMD 3 UML SIMD Linux 2 c 2012 Information Processing Society of Japan

3 int a[256],b[256],c[256] foo () { int i; for( i = 0 ; i < 256 ; i++) { a[i] = b[i] + c[i]; } } 3 4 SIMD UML UML Linux SIMD OS SIMD 4. GNU Compiler Collection (GCC) GCC C SIMD SIMD SIMD C SIMD SIMD GCC ftree-vectorize GCC SIMD GCC SIMD 4 int a[256], b[256], c[256] a[5] = b[5] + c[5] a[6] = b[6] + c[6] SIMD SIMD 5. SIMD OS Linux OS 3 c 2012 Information Processing Society of Japan

4 gprof 5.1 SIMD UML UML Linux GCC IBM ThinkPad X1 CPU Intel Core-i5 2520M 2.50GHz RAM 4GB Intel Core-i5 2520M Intel SIMD SSE3 AVX 5.2 UML UML GCC ftree-vectorize ftree-vectorize-verbose ftree-vectorize GCC ftree-vectorize-verbose 23 GCC OS OS OS 1 Linux source file vectorized loop num /arch/um/drivers/drivers/slip user.c 1 /mm/vmstat.c 1 /fs/ext2/inode.c 2 /fs/ext3/inode.c 2 /fs/ext3/hash.c 2 /fs/isofs/util.c 1 /fs/reiserfs/fix node.c 1 /drivers/base/map.c 1 /net/ipv4/inet hashtables.c 1 /net/ipv4/tcp input.c 1 /net/ipv4/devinet.c 1 /lib/sort.c 1 /lib/bitmap.c 5 /lib/cmdline.c 1 /net/core/dev.c 1 /crypto/algapi.c UML SIMD OS UML GNU Profiler (gprof) Unix Bench 3) SIMD UNIX Bench UML gprof UML UML 2 UNIX Bench memcpy memcpy 4 c 2012 Information Processing Society of Japan

5 2 Linux function consumption time (sec) share (%) memcpy os arch prctl hard handler userspace strncpy SIMD memcpy memcpy SIMD memcpy SIMD SIMD memcpy memcpy SIMD memcpy intel SSE SSE intel x86 SIMD SIMD memcpy movdqu SIMD / movdqu 128 / movdqu 128 / SIMD memcpy memcpy x86 SIMD memcpy memcpy 5,000,000 5 SIMD memcpy memcpy SIMD memcpy memcpy 512 SIMD memcpy SIMD memcpy SIMD x86 5 c 2012 Information Processing Society of Japan

6 4 SIMD memcpy total time memcpy time memcpy rate(%) normal UML s 12.96s 8.72% SIMD memcpy UML s 8.75s 5.68% 33% 8.72% 5.68% UNIXBench OS sec sec 5% 4 UML SIMD UML 8. 6 SIMD memcpy memcpy 3 SIMD memcpy UML function consumption time (sec) rate (%) os arch prctl memcpy hard handler userspace strncpy SIMD memcpy SIMD memcpy UML memcpy SIMD memcpy UNIX BENCH 3 SIMD memcpy UML UML UNIXBench gprof SIMD memcpy 12.96sec 8.75sec 8.1 OS SIMD SIMD memcpy 128 memcpy OS memcpy memcpy memcpy 512 memcpy SIMD OS 6 c 2012 Information Processing Society of Japan

7 8.2 SIMD memcpy memcpy SIMD SIMD SIMD memcpy memcpy SIMD SIMD SIMD memcpy memcpy memcpy memcpy memcpy 128 SIMD SIMD memcpy memcpy SIMD 8.3 SIMD UML OS SIMD SIMD SIMD OS SIMD SIMD OS SIMD SIMD ARM SIMD ARM SIMD NEON NEON SIMD OS SIMD OS SIMD OS SIMD 9. SIMD OS SIMD OS OS OS SIMD UML(User Mode Linux) GCC OS OS OS SIMD OS SIMD intel SIMD AVX 256 SIMD OS 7 c 2012 Information Processing Society of Japan

8 1) Takashi Nakamura, Satoshi Miki, Shuichi Oikawa, Automatic Vectorization by Runtime Binary Translation, In Proceedings of 2011 Second International Conference on Networking and Computing,pp.87-94, ) The User-mode Linux Kernel Home Page 3) byte-unixbench Unix benchmark Suite 4) Intel 64 and IA-32 Architectures Software Developer s Manuals 5) Intel Applications Tuning for Streaming SIMD Extensions PDF/apps simd.pdf 8 c 2012 Information Processing Society of Japan

High Performance Computing and Programming 2015 Lab 6 SIMD and Vectorization

High Performance Computing and Programming 2015 Lab 6 SIMD and Vectorization High Performance Computing and Programming 2015 Lab 6 SIMD and Vectorization 1 Introduction The purpose of this lab assignment is to give some experience in using SIMD instructions on x86 and getting compiler

More information

SWAR: MMX, SSE, SSE 2 Multiplatform Programming

SWAR: MMX, SSE, SSE 2 Multiplatform Programming SWAR: MMX, SSE, SSE 2 Multiplatform Programming Relatore: dott. Matteo Roffilli roffilli@csr.unibo.it 1 What s SWAR? SWAR = SIMD Within A Register SIMD = Single Instruction Multiple Data MMX,SSE,SSE2,Power3DNow

More information

Toward Building up Arm HPC Ecosystem --Fujitsu s Activities--

Toward Building up Arm HPC Ecosystem --Fujitsu s Activities-- Toward Building up Arm HPC Ecosystem --Fujitsu s Activities-- Shinji Sumimoto, Ph.D. Next Generation Technical Computing Unit FUJITSU LIMITED Jun. 28 th, 2018 0 Copyright 2018 FUJITSU LIMITED Outline of

More information

FFTSS Library Version 3.0 User s Guide

FFTSS Library Version 3.0 User s Guide Last Modified: 31/10/07 FFTSS Library Version 3.0 User s Guide Copyright (C) 2002-2007 The Scalable Software Infrastructure Project, is supported by the Development of Software Infrastructure for Large

More information

Lecture 25: Interrupt Handling and Multi-Data Processing. Spring 2018 Jason Tang

Lecture 25: Interrupt Handling and Multi-Data Processing. Spring 2018 Jason Tang Lecture 25: Interrupt Handling and Multi-Data Processing Spring 2018 Jason Tang 1 Topics Interrupt handling Vector processing Multi-data processing 2 I/O Communication Software needs to know when: I/O

More information

Kernel level AES Acceleration using GPUs

Kernel level AES Acceleration using GPUs Kernel level AES Acceleration using GPUs TABLE OF CONTENTS 1 PROBLEM DEFINITION 1 2 MOTIVATIONS.................................................1 3 OBJECTIVE.....................................................2

More information

KeyStone II. CorePac Overview

KeyStone II. CorePac Overview KeyStone II ARM Cortex A15 CorePac Overview ARM A15 CorePac in KeyStone II Standard ARM Cortex A15 MPCore processor Cortex A15 MPCore version r2p2 Quad core, dual core, and single core variants 4096kB

More information

Multicore Performance and Tools. Part 1: Topology, affinity, clock speed

Multicore Performance and Tools. Part 1: Topology, affinity, clock speed Multicore Performance and Tools Part 1: Topology, affinity, clock speed Tools for Node-level Performance Engineering Gather Node Information hwloc, likwid-topology, likwid-powermeter Affinity control and

More information

Parallel Computing. Prof. Marco Bertini

Parallel Computing. Prof. Marco Bertini Parallel Computing Prof. Marco Bertini Modern CPUs Historical trends in CPU performance From Data processing in exascale class computer systems, C. Moore http://www.lanl.gov/orgs/hpc/salishan/salishan2011/3moore.pdf

More information

Lecture 11 - Portability and Optimizations

Lecture 11 - Portability and Optimizations Lecture 11 - Portability and Optimizations This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/

More information

Figure 1: 128-bit registers introduced by SSE. 128 bits. xmm0 xmm1 xmm2 xmm3 xmm4 xmm5 xmm6 xmm7

Figure 1: 128-bit registers introduced by SSE. 128 bits. xmm0 xmm1 xmm2 xmm3 xmm4 xmm5 xmm6 xmm7 SE205 - TD1 Énoncé General Instructions You can download all source files from: https://se205.wp.mines-telecom.fr/td1/ SIMD-like Data-Level Parallelism Modern processors often come with instruction set

More information

Double-precision General Matrix Multiply (DGEMM)

Double-precision General Matrix Multiply (DGEMM) Double-precision General Matrix Multiply (DGEMM) Parallel Computation (CSE 0), Assignment Andrew Conegliano (A0) Matthias Springer (A00) GID G-- January, 0 0. Assumptions The following assumptions apply

More information

Preliminary Performance Evaluation of Application Kernels using ARM SVE with Multiple Vector Lengths

Preliminary Performance Evaluation of Application Kernels using ARM SVE with Multiple Vector Lengths Preliminary Performance Evaluation of Application Kernels using ARM SVE with Multiple Vector Lengths Y. Kodama, T. Odajima, M. Matsuda, M. Tsuji, J. Lee and M. Sato RIKEN AICS (Advanced Institute for Computational

More information

last time out-of-order execution and instruction queues the data flow model idea

last time out-of-order execution and instruction queues the data flow model idea 1 last time 2 out-of-order execution and instruction queues the data flow model idea graph of operations linked by depedencies latency bound need to finish longest dependency chain multiple accumulators

More information

Technical Report. Research Lab: LERIA

Technical Report. Research Lab: LERIA Technical Report Improvement of Fitch function for Maximum Parsimony in Phylogenetic Reconstruction with Intel AVX2 assembler instructions Research Lab: LERIA TR20130624-1 Version 1.0 24 June 2013 JEAN-MICHEL

More information

Case Study. Optimizing an Illegal Image Filter System. Software. Intel Integrated Performance Primitives. High-Performance Computing

Case Study. Optimizing an Illegal Image Filter System. Software. Intel Integrated Performance Primitives. High-Performance Computing Case Study Software Optimizing an Illegal Image Filter System Intel Integrated Performance Primitives High-Performance Computing Tencent Doubles the Speed of its Illegal Image Filter System using SIMD

More information

Code Quality Analyzer (CQA)

Code Quality Analyzer (CQA) Code Quality Analyzer (CQA) CQA for Intel 64 architectures Version 1.5b October 2018 www.maqao.org 1 1 Introduction MAQAO-CQA (MAQAO Code Quality Analyzer) is the MAQAO module addressing the code quality

More information

Fundamentals of Computer Design

Fundamentals of Computer Design CS359: Computer Architecture Fundamentals of Computer Design Yanyan Shen Department of Computer Science and Engineering 1 Defining Computer Architecture Agenda Introduction Classes of Computers 1.3 Defining

More information

Kampala August, Agner Fog

Kampala August, Agner Fog Advanced microprocessor optimization Kampala August, 2007 Agner Fog www.agner.org Agenda Intel and AMD microprocessors Out Of Order execution Branch prediction Platform, 32 or 64 bits Choice of compiler

More information

Power Measurements using performance counters

Power Measurements using performance counters Power Measurements using performance counters CSL862: Low-Power Computing By Suman A M (2015SIY7524) Android Power Consumption in Android Power Consumption in Smartphones are powered from batteries which

More information

Contour Detection on Mobile Platforms

Contour Detection on Mobile Platforms Contour Detection on Mobile Platforms Bor-Yiing Su, subrian@eecs.berkeley.edu Prof. Kurt Keutzer, keutzer@eecs.berkeley.edu Parallel Computing Lab, University of California, Berkeley 1/26 Diagnosing Power/Performance

More information

Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2

Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2 Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2 This release of the Intel C++ Compiler 16.0 product is a Pre-Release, and as such is 64 architecture processor supporting

More information

H.J. Lu, Sunil K Pandey. Intel. November, 2018

H.J. Lu, Sunil K Pandey. Intel. November, 2018 H.J. Lu, Sunil K Pandey Intel November, 2018 Issues with Run-time Library on IA Memory, string and math functions in today s glibc are optimized for today s Intel processors: AVX/AVX2/AVX512 FMA It takes

More information

OpenMP: Vectorization and #pragma omp simd. Markus Höhnerbach

OpenMP: Vectorization and #pragma omp simd. Markus Höhnerbach OpenMP: Vectorization and #pragma omp simd Markus Höhnerbach 1 / 26 Where does it come from? c i = a i + b i i a 1 a 2 a 3 a 4 a 5 a 6 a 7 a 8 + b 1 b 2 b 3 b 4 b 5 b 6 b 7 b 8 = c 1 c 2 c 3 c 4 c 5 c

More information

Martin Kruliš, v

Martin Kruliš, v Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal

More information

Programming Parallel Computers

Programming Parallel Computers ICS-E4020 Programming Parallel Computers Jukka Suomela Jaakko Lehtinen Samuli Laine Aalto University Spring 2016 users.ics.aalto.fi/suomela/ppc-2016/ New code must be parallel! otherwise a computer from

More information

Post-K: Building the Arm HPC Ecosystem

Post-K: Building the Arm HPC Ecosystem Post-K: Building the Arm HPC Ecosystem Toshiyuki Shimizu FUJITSU LIMITED Nov. 14th, 2017 Exhibitor Forum, SC17, Nov. 14, 2017 0 Post-K: Building up Arm HPC Ecosystem Fujitsu s approach for HPC Approach

More information

A Fast Instruction Set Simulator for RISC-V

A Fast Instruction Set Simulator for RISC-V A Fast Instruction Set Simulator for RISC-V Maxim.Maslov@esperantotech.com Vadim.Gimpelson@esperantotech.com Nikita.Voronov@esperantotech.com Dave.Ditzel@esperantotech.com Esperanto Technologies, Inc.

More information

CSE 160 Lecture 10. Instruction level parallelism (ILP) Vectorization

CSE 160 Lecture 10. Instruction level parallelism (ILP) Vectorization CSE 160 Lecture 10 Instruction level parallelism (ILP) Vectorization Announcements Quiz on Friday Signup for Friday labs sessions in APM 2013 Scott B. Baden / CSE 160 / Winter 2013 2 Particle simulation

More information

Parallel Programming. Easy Cases: Data Parallelism

Parallel Programming. Easy Cases: Data Parallelism Parallel Programming The preferred parallel algorithm is generally different from the preferred sequential algorithm Compilers cannot transform a sequential algorithm into a parallel one with adequate

More information

Porting Linux to x86-64

Porting Linux to x86-64 Porting Linux to x86-64 Andi Kleen SuSE Labs ak@suse.de Abstract... Some implementation details with changes over the existing i386 port are discussed. 1 Introduction x86-64 is a new architecture developed

More information

Intel Math Kernel Library 10.3

Intel Math Kernel Library 10.3 Intel Math Kernel Library 10.3 Product Brief Intel Math Kernel Library 10.3 The Flagship High Performance Computing Math Library for Windows*, Linux*, and Mac OS* X Intel Math Kernel Library (Intel MKL)

More information

ECE 471 Embedded Systems Lecture 2

ECE 471 Embedded Systems Lecture 2 ECE 471 Embedded Systems Lecture 2 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 7 September 2018 Announcements Reminder: The class notes are posted to the website. HW#1 will

More information

FAST FIR FILTERS FOR SIMD PROCESSORS WITH LIMITED MEMORY BANDWIDTH

FAST FIR FILTERS FOR SIMD PROCESSORS WITH LIMITED MEMORY BANDWIDTH Key words: Digital Signal Processing, FIR filters, SIMD processors, AltiVec. Grzegorz KRASZEWSKI Białystok Technical University Department of Electrical Engineering Wiejska

More information

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs Lecture 5 Performance programming for stencil methods Vectorization Computing with GPUs Announcements Forge accounts: set up ssh public key, tcsh Turnin was enabled for Programming Lab #1: due at 9pm today,

More information

IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM

IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM I5 AND I7 PROCESSORS Juan M. Cebrián 1 Lasse Natvig 1 Jan Christian Meyer 2 1 Depart. of Computer and Information

More information

Performance of Host Identity Protocol on Lightweight Hardware

Performance of Host Identity Protocol on Lightweight Hardware Performance of Host Identity Protocol on Lightweight Hardware Andrey Khurri, Ekaterina Vorobyeva, Andrei Gurtov Helsinki Institute for Information Technology MobiArch'07 Kyoto,

More information

The Mont-Blanc Project

The Mont-Blanc Project http://www.montblanc-project.eu The Mont-Blanc Project Daniele Tafani Leibniz Supercomputing Centre 1 Ter@tec Forum 26 th June 2013 This project and the research leading to these results has received funding

More information

The Mont-Blanc approach towards Exascale

The Mont-Blanc approach towards Exascale http://www.montblanc-project.eu The Mont-Blanc approach towards Exascale Alex Ramirez Barcelona Supercomputing Center Disclaimer: Not only I speak for myself... All references to unavailable products are

More information

Dan Stafford, Justine Bonnot

Dan Stafford, Justine Bonnot Dan Stafford, Justine Bonnot Background Applications Timeline MMX 3DNow! Streaming SIMD Extension SSE SSE2 SSE3 and SSSE3 SSE4 Advanced Vector Extension AVX AVX2 AVX-512 Compiling with x86 Vector Processing

More information

OpenCL Vectorising Features. Andreas Beckmann

OpenCL Vectorising Features. Andreas Beckmann Mitglied der Helmholtz-Gemeinschaft OpenCL Vectorising Features Andreas Beckmann Levels of Vectorisation vector units, SIMD devices width, instructions SMX, SP cores Cus, PEs vector operations within kernels

More information

Review. Topics. Lecture 3. Advanced Programming Topics. Review: How executable files are generated. Defining macros through compilation flags

Review. Topics. Lecture 3. Advanced Programming Topics. Review: How executable files are generated. Defining macros through compilation flags Review Dynamic memory allocation Look a-like dynamic 2D array Simulated 2D array How cache memory / cache line works Lecture 3 Command line arguments Pre-processor directives #define #ifdef #else #endif

More information

A Simple Path to Parallelism with Intel Cilk Plus

A Simple Path to Parallelism with Intel Cilk Plus Introduction This introductory tutorial describes how to use Intel Cilk Plus to simplify making taking advantage of vectorization and threading parallelism in your code. It provides a brief description

More information

OpenMP on the FDSM software distributed shared memory. Hiroya Matsuba Yutaka Ishikawa

OpenMP on the FDSM software distributed shared memory. Hiroya Matsuba Yutaka Ishikawa OpenMP on the FDSM software distributed shared memory Hiroya Matsuba Yutaka Ishikawa 1 2 Software DSM OpenMP programs usually run on the shared memory computers OpenMP programs work on the distributed

More information

High Performance Matrix-matrix Multiplication of Very Small Matrices

High Performance Matrix-matrix Multiplication of Very Small Matrices High Performance Matrix-matrix Multiplication of Very Small Matrices Ian Masliah, Marc Baboulin, ICL people University Paris-Sud - LRI Sparse Days Cerfacs, Toulouse, 1/07/2016 Context Tensor Contractions

More information

Portable Power/Performance Benchmarking and Analysis with WattProf

Portable Power/Performance Benchmarking and Analysis with WattProf Portable Power/Performance Benchmarking and Analysis with WattProf Amir Farzad, Boyana Norris University of Oregon Mohammad Rashti RNET Technologies, Inc. Motivation Energy efficiency is becoming increasingly

More information

U23 - Binary Exploitation

U23 - Binary Exploitation U23 - Binary Exploitation Stratum Auhuur robbje@aachen.ccc.de November 21, 2016 Context OS: Linux Context OS: Linux CPU: x86 (32 bit) Context OS: Linux CPU: x86 (32 bit) Address Space Layout Randomization:

More information

Lecture 2. Systems Programming with the Raspberry Pi

Lecture 2. Systems Programming with the Raspberry Pi F28HS Hardware-Software Interface: Systems Programming Hans-Wolfgang Loidl School of Mathematical and Computer Sciences, Heriot-Watt University, Edinburgh Semester 2 2015/16 0 No proprietary software has

More information

Team 1. Common Questions to all Teams. Team 2. Team 3. CO200-Computer Organization and Architecture - Assignment One

Team 1. Common Questions to all Teams. Team 2. Team 3. CO200-Computer Organization and Architecture - Assignment One CO200-Computer Organization and Architecture - Assignment One Note: A team may contain not more than 2 members. Format the assignment solutions in a L A TEX document. E-mail the assignment solutions PDF

More information

Integrating DMA capabilities into BLIS for on-chip data movement. Devangi Parikh Ilya Polkovnichenko Francisco Igual Peña Murtaza Ali

Integrating DMA capabilities into BLIS for on-chip data movement. Devangi Parikh Ilya Polkovnichenko Francisco Igual Peña Murtaza Ali Integrating DMA capabilities into BLIS for on-chip data movement Devangi Parikh Ilya Polkovnichenko Francisco Igual Peña Murtaza Ali 5 Generations of TI Multicore Processors Keystone architecture Lowers

More information

Fahad Zafar, Dibyajyoti Ghosh, Lawrence Sebald, Shujia Zhou. University of Maryland Baltimore County

Fahad Zafar, Dibyajyoti Ghosh, Lawrence Sebald, Shujia Zhou. University of Maryland Baltimore County Accelerating a climate physics model with OpenCL Fahad Zafar, Dibyajyoti Ghosh, Lawrence Sebald, Shujia Zhou University of Maryland Baltimore County Introduction The demand to increase forecast predictability

More information

SIMD Instructions outside and inside Oracle 12c. Laurent Léturgez 2016

SIMD Instructions outside and inside Oracle 12c. Laurent Léturgez 2016 SIMD Instructions outside and inside Oracle 2c Laurent Léturgez 206 Whoami Oracle Consultant since 200 Former developer (C, Java, perl, PL/SQL) Owner@Premiseo: Data Management on Premise and in the Cloud

More information

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management Hideyuki Shamoto, Tatsuhiro Chiba, Mikio Takeuchi Tokyo Institute of Technology IBM Research Tokyo Programming for large

More information

Pa-risc 1.1 Architecture And Instruction Set >>>CLICK HERE<<<

Pa-risc 1.1 Architecture And Instruction Set >>>CLICK HERE<<< Pa-risc 1.1 Architecture And Instruction Set Reference Manual 1 Rationale. 1.1 Motivation IBM 801 2), PA-Risc 4) and Monads 19), consist of one entry per physical. page frame PA-RISC 1.1 Architecture and

More information

Turbostream: A CFD solver for manycore

Turbostream: A CFD solver for manycore Turbostream: A CFD solver for manycore processors Tobias Brandvik Whittle Laboratory University of Cambridge Aim To produce an order of magnitude reduction in the run-time of CFD solvers for the same hardware

More information

Lecture 03 Bits, Bytes and Data Types

Lecture 03 Bits, Bytes and Data Types Lecture 03 Bits, Bytes and Data Types Computer Languages A computer language is a language that is used to communicate with a machine. Like all languages, computer languages have syntax (form) and semantics

More information

Breaking the Boundaries in Heterogeneous-ISA Datacenters

Breaking the Boundaries in Heterogeneous-ISA Datacenters Breaking the Boundaries in Heterogeneous-ISA Datacenters Antonio Barbalace, Robert Lyerly, Christopher Jelesnianski, Anthony Carno, Ho-Ren Chuang, Vincent Legout and Binoy Ravindran Systems Software Research

More information

Systems Design and Programming. Instructor: Chintan Patel

Systems Design and Programming. Instructor: Chintan Patel Systems Design and Programming Instructor: Chintan Patel Text: Barry B. Brey, 'The Intel Microprocessors, 8086/8088, 80186/80188, 80286, 80386, 80486, Pentium and Pentium Pro Processor, Pentium II, Pentium

More information

A Case Study in Optimizing GNU Radio s ATSC Flowgraph

A Case Study in Optimizing GNU Radio s ATSC Flowgraph A Case Study in Optimizing GNU Radio s ATSC Flowgraph Presented by Greg Scallon and Kirby Cartwright GNU Radio Conference 2017 Thursday, September 14 th 10am ATSC FLOWGRAPH LOADING 3% 99% 76% 36% 10% 33%

More information

Advanced OpenMP Features

Advanced OpenMP Features Christian Terboven, Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group {terboven,schmidl@itc.rwth-aachen.de IT Center der RWTH Aachen University Vectorization 2 Vectorization SIMD =

More information

Freescale Semiconductor gcc linaro toolchain, Rev

Freescale Semiconductor gcc linaro toolchain, Rev ABOUT GCC LINARO 4.6.2 MULTILIB TOOLCHAIN 1 What s new... 2 2 What s inside... 2 3 How to use... 3 3.1 gcc... 3 3.2 Application debug tools... 5 4 Appendix... 6 4.1 Toolchain test result... 6 4.1.1 Test

More information

EXTENDING THE REACH OF PARALLEL COMPUTING WITH CUDA

EXTENDING THE REACH OF PARALLEL COMPUTING WITH CUDA EXTENDING THE REACH OF PARALLEL COMPUTING WITH CUDA Mark Harris, NVIDIA @harrism #NVSC14 EXTENDING THE REACH OF CUDA 1 Machine Learning 2 Higher Performance 3 New Platforms 4 New Languages 2 GPUS: THE

More information

Exercise Session 6. Data Processing on Modern Hardware L Fall Semester Cagri Balkesen

Exercise Session 6. Data Processing on Modern Hardware L Fall Semester Cagri Balkesen Cagri Balkesen Data Processing on Modern Hardware Exercises Fall 2012 1 Exercise Session 6 Data Processing on Modern Hardware 263-3502-00L Fall Semester 2012 Cagri Balkesen cagri.balkesen@inf.ethz.ch Department

More information

Lectures Parallelism

Lectures Parallelism Lectures 24-25 Parallelism 1 Pipelining vs. Parallel processing In both cases, multiple things processed by multiple functional units Pipelining: each thing is broken into a sequence of pieces, where each

More information

SIMD Exploitation in (JIT) Compilers

SIMD Exploitation in (JIT) Compilers SIMD Exploitation in (JIT) Compilers Hiroshi Inoue, IBM Research - Tokyo 1 What s SIMD? Single Instruction Multiple Data Same operations applied for multiple elements in a vector register input 1 A0 input

More information

Intel Parallel Studio XE 2015 Composer Edition for Linux* Installation Guide and Release Notes

Intel Parallel Studio XE 2015 Composer Edition for Linux* Installation Guide and Release Notes Intel Parallel Studio XE 2015 Composer Edition for Linux* Installation Guide and Release Notes 23 October 2014 Table of Contents 1 Introduction... 1 1.1 Product Contents... 2 1.2 Intel Debugger (IDB) is

More information

Lecture 12: Instruction Execution and Pipelining. William Gropp

Lecture 12: Instruction Execution and Pipelining. William Gropp Lecture 12: Instruction Execution and Pipelining William Gropp www.cs.illinois.edu/~wgropp Yet More To Consider in Understanding Performance We have implicitly assumed that an operation takes one clock

More information

EXTENDING THE GFN PRIME SEARCH BEYOND 1M DIGITS USING GPUS

EXTENDING THE GFN PRIME SEARCH BEYOND 1M DIGITS USING GPUS EXTENDING THE GFN PRIME SEARCH BEYOND 1M DIGITS USING GPUS PPAM 2013, Warsaw Iain Bethune and Michael Goetz Outline PrimeGrid Genefer: Background Genefer: New developments GFN prime search status Future

More information

Piecewise Holistic Autotuning of Compiler and Runtime Parameters

Piecewise Holistic Autotuning of Compiler and Runtime Parameters Piecewise Holistic Autotuning of Compiler and Runtime Parameters Mihail Popov, Chadi Akel, William Jalby, Pablo de Oliveira Castro University of Versailles Exascale Computing Research August 2016 C E R

More information

F28HS Hardware-Software Interface: Systems Programming

F28HS Hardware-Software Interface: Systems Programming F28HS Hardware-Software Interface: Systems Programming Hans-Wolfgang Loidl School of Mathematical and Computer Sciences, Heriot-Watt University, Edinburgh Semester 2 2017/18 0 No proprietary software has

More information

Acceleration of Virtual Machine Live Migration on QEMU/KVM by Reusing VM Memory

Acceleration of Virtual Machine Live Migration on QEMU/KVM by Reusing VM Memory Acceleration of Virtual Machine Live Migration on QEMU/KVM by Reusing VM Memory Soramichi Akiyama Department of Creative Informatics Graduate School of Information Science and Technology The University

More information

Modeling CPU Energy Consumption for Energy Efficient Scheduling

Modeling CPU Energy Consumption for Energy Efficient Scheduling Modeling CPU Energy Consumption for Energy Efficient Scheduling Abhishek Jaiantilal, Yifei Jiang, Shivakant Mishra University of Colorado - Boulder GCM '10 Proceedings of the 1st Workshop on Green Computing

More information

Byte Ordering. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Byte Ordering. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University Byte Ordering Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong (jinkyu@skku.edu)

More information

Introduction to the Tegra SoC Family and the ARM Architecture. Kristoffer Robin Stokke, PhD FLIR UAS

Introduction to the Tegra SoC Family and the ARM Architecture. Kristoffer Robin Stokke, PhD FLIR UAS Introduction to the Tegra SoC Family and the ARM Architecture Kristoffer Robin Stokke, PhD FLIR UAS Goals of Lecture To give you something concrete to start on Simple introduction to ARMv8 NEON programming

More information

Chapter 5. Introduction ARM Cortex series

Chapter 5. Introduction ARM Cortex series Chapter 5 Introduction ARM Cortex series 5.1 ARM Cortex series variants 5.2 ARM Cortex A series 5.3 ARM Cortex R series 5.4 ARM Cortex M series 5.5 Comparison of Cortex M series with 8/16 bit MCUs 51 5.1

More information

SimBench. A Portable Benchmarking Methodology for Full-System Simulators. Harry Wagstaff Bruno Bodin Tom Spink Björn Franke

SimBench. A Portable Benchmarking Methodology for Full-System Simulators. Harry Wagstaff Bruno Bodin Tom Spink Björn Franke SimBench A Portable Benchmarking Methodology for Full-System Simulators Harry Wagstaff Bruno Bodin Tom Spink Björn Franke Institute for Computing Systems Architecture University of Edinburgh ISPASS 2017

More information

An introduction to today s Modular Operating System

An introduction to today s Modular Operating System An introduction to today s Modular Operating System Bun K. Tan Open Source Technology Center - Intel Corporation October 2018 *Other names and brands may be claimed as the property of others Agenda Why

More information

High Performance Computing: Tools and Applications

High Performance Computing: Tools and Applications High Performance Computing: Tools and Applications Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology Lecture 8 Processor-level SIMD SIMD instructions can perform

More information

Preparing for Highly Parallel, Heterogeneous Coprocessing

Preparing for Highly Parallel, Heterogeneous Coprocessing Preparing for Highly Parallel, Heterogeneous Coprocessing Steve Lantz Senior Research Associate Cornell CAC Workshop: Parallel Computing on Ranger and Lonestar May 17, 2012 What Are We Talking About Here?

More information

Performance comparison and optimization: Case studies using BenchIT

Performance comparison and optimization: Case studies using BenchIT John von Neumann Institute for Computing Performance comparison and optimization: Case studies using BenchIT R. Schöne, G. Juckeland, W.E. Nagel, S. Pflüger, R. Wloch published in Parallel Computing: Current

More information

A C compiler for Large Data Sequential Processing using Remote Memory

A C compiler for Large Data Sequential Processing using Remote Memory A C compiler for Large Data Sequential Processing using Remote Memory Shiyo Yoshimura, Hiroko Midorikawa Graduate School of Science and Technology, Seikei University, Tokyo, Japan E-mail:dm106231@cc.seikei.ac.jp,

More information

ArcGIS Engine Developer Kit System Requirements

ArcGIS Engine Developer Kit System Requirements ArcGIS Engine Developer Kit 9.0.1 System Requirements This PDF contains system requirements information, including hardware requirements, best performance configurations, and limitations, for ArcGIS Engine

More information

Assembly Language for Intel-Based Computers, 4 th Edition. Chapter 2: IA-32 Processor Architecture Included elements of the IA-64 bit

Assembly Language for Intel-Based Computers, 4 th Edition. Chapter 2: IA-32 Processor Architecture Included elements of the IA-64 bit Assembly Language for Intel-Based Computers, 4 th Edition Kip R. Irvine Chapter 2: IA-32 Processor Architecture Included elements of the IA-64 bit Slides prepared by Kip R. Irvine Revision date: 09/25/2002

More information

Performance of deal.ii on a node

Performance of deal.ii on a node Performance of deal.ii on a node Bruno Turcksin Texas A&M University, Dept. of Mathematics Bruno Turcksin Deal.II on a node 1/37 Outline 1 Introduction 2 Architecture 3 Paralution 4 Other Libraries 5 Conclusions

More information

Lecture 4. Performance: memory system, programming, measurement and metrics Vectorization

Lecture 4. Performance: memory system, programming, measurement and metrics Vectorization Lecture 4 Performance: memory system, programming, measurement and metrics Vectorization Announcements Dirac accounts www-ucsd.edu/classes/fa12/cse260-b/dirac.html Mac Mini Lab (with GPU) Available for

More information

Programming Parallel Computers

Programming Parallel Computers ICS-E4020 Programming Parallel Computers Jukka Suomela Jaakko Lehtinen Samuli Laine Aalto University Spring 2015 users.ics.aalto.fi/suomela/ppc-2015/ Introduction Modern computers have high-performance

More information

Linux ftrace, , Android Systrace. Android [2][3]. Linux ftrace. Linux. Intel VTune[6] perf timechart[7]. ,, GPU Intel. .

Linux ftrace, , Android Systrace. Android [2][3]. Linux ftrace. Linux. Intel VTune[6] perf timechart[7]. ,, GPU Intel. . Linux ftrace 1 1 1 Dominic Hillenbrand 1 1 1,.,,.,., Linux ftrace., Intel Xeon X7560, ARMv7 equake, art, mpeg2enc OS., 1 Intel Xeon 1.07[us], ARM 4.44[us]., Linux, ftrace, 1...,.,,,., [1].,.,. 1 Waseda

More information

Super Matrix Solver-P-ICCG:

Super Matrix Solver-P-ICCG: Super Matrix Solver-P-ICCG: February 2011 VINAS Co., Ltd. Project Development Dept. URL: http://www.vinas.com All trademarks and trade names in this document are properties of their respective owners.

More information

SSE and SSE2. Timothy A. Chagnon 18 September All images from Intel 64 and IA 32 Architectures Software Developer's Manuals

SSE and SSE2. Timothy A. Chagnon 18 September All images from Intel 64 and IA 32 Architectures Software Developer's Manuals SSE and SSE2 Timothy A. Chagnon 18 September 2007 All images from Intel 64 and IA 32 Architectures Software Developer's Manuals Overview SSE: Streaming SIMD (Single Instruction Multiple Data) Extensions

More information

Loop-Oriented Array- and Field-Sensitive Pointer Analysis for Automatic SIMD Vectorization

Loop-Oriented Array- and Field-Sensitive Pointer Analysis for Automatic SIMD Vectorization Loop-Oriented Array- and Field-Sensitive Pointer Analysis for Automatic SIMD Vectorization Yulei Sui, Xiaokang Fan, Hao Zhou and Jingling Xue School of Computer Science and Engineering The University of

More information

What Transitioning from 32-bit to 64-bit x86 Computing Means Today

What Transitioning from 32-bit to 64-bit x86 Computing Means Today What Transitioning from 32-bit to 64-bit x86 Computing Means Today Chris Wanner Senior Architect, Industry Standard Servers Hewlett-Packard 2004 Hewlett-Packard Development Company, L.P. The information

More information

UMBC. Rubini and Corbet, Linux Device Drivers, 2nd Edition, O Reilly. Systems Design and Programming

UMBC. Rubini and Corbet, Linux Device Drivers, 2nd Edition, O Reilly. Systems Design and Programming Systems Design and Programming Instructor: Professor Jim Plusquellic Text: Barry B. Brey, The Intel Microprocessors, 8086/8088, 80186/80188, 80286, 80386, 80486, Pentium and Pentium Pro Processor Architecture,

More information

Things to know about Numeric Computation

Things to know about Numeric Computation Things to know about Numeric Computation Classes of Numbers Countable Sets of Numbers: N: Natural Numbers {1, 2, 3, 4...}. Z: Integers (contains N) {..., -3, -2, -1, 0, 1, 2, 3,...} Q: Rational Numbers

More information

Installation Guide and Release Notes

Installation Guide and Release Notes Intel Parallel Studio XE 2013 for Linux* Installation Guide and Release Notes Document number: 323804-003US 10 March 2013 Table of Contents 1 Introduction... 1 1.1 What s New... 1 1.1.1 Changes since Intel

More information

Identifying Working Data Set of Particular Loop Iterations for Dynamic Performance Tuning

Identifying Working Data Set of Particular Loop Iterations for Dynamic Performance Tuning Identifying Working Data Set of Particular Loop Iterations for Dynamic Performance Tuning Yukinori Sato (JAIST / JST CREST) Hiroko Midorikawa (Seikei Univ. / JST CREST) Toshio Endo (TITECH / JST CREST)

More information

SIMD Parallel Sparse Matrix-Vector and Transposed-Matrix-Vector Multiplication in DD Precision

SIMD Parallel Sparse Matrix-Vector and Transposed-Matrix-Vector Multiplication in DD Precision SIMD Parallel Sparse Matrix-Vector and Transposed-Matrix-Vector Multiplication in DD Precision Toshiaki Hishinuma 1, Hidehiko Hasegawa 12, and Teruo Tanaka 2 1 University of Tsukuba, Tsukuba, Japan 2 Kogakuin

More information

Splotch: High Performance Visualization using MPI, OpenMP and CUDA

Splotch: High Performance Visualization using MPI, OpenMP and CUDA Splotch: High Performance Visualization using MPI, OpenMP and CUDA Klaus Dolag (Munich University Observatory) Martin Reinecke (MPA, Garching) Claudio Gheller (CSCS, Switzerland), Marzia Rivi (CINECA,

More information

Overhead Evaluation about Kprobes and Djprobe (Direct Jump Probe)

Overhead Evaluation about Kprobes and Djprobe (Direct Jump Probe) Overhead Evaluation about Kprobes and Djprobe (Direct Jump Probe) Masami Hiramatsu Hitachi, Ltd., SDL Jul. 13. 25 1. Abstract To implement flight recorder system, the overhead

More information

Progress Report on QDP-JIT

Progress Report on QDP-JIT Progress Report on QDP-JIT F. T. Winter Thomas Jefferson National Accelerator Facility USQCD Software Meeting 14 April 16-17, 14 at Jefferson Lab F. Winter (Jefferson Lab) QDP-JIT USQCD-Software 14 1 /

More information

A Bytecode Interpreter for Secure Program Execution in Untrusted Main Memory

A Bytecode Interpreter for Secure Program Execution in Untrusted Main Memory A Bytecode Interpreter for Secure Program Execution in Untrusted Main Memory Maximilian Seitzer, Michael Gruhn, Tilo Müller Friedrich Alexander Universität Erlangen-Nürnberg https://www1.cs.fau.de Introduction

More information