Energy Efficient Transparent Library Accelera4on with CAPI Heiner Giefers IBM Research Zurich

Size: px
Start display at page:

Download "Energy Efficient Transparent Library Accelera4on with CAPI Heiner Giefers IBM Research Zurich"

Transcription

1 Energy Efficient Transparent Library Accelera4on with CAPI Heiner Giefers IBM Research Zurich Revolu'onizing the Datacenter Datacenter Join the Conversa'on #OpenPOWERSummit

2 Towards highly efficient data centers PUE op'miza'on and virtualiza'on energy- efficient architectures next- genera'on devices Workload consolida4on Efficient cooling Heterogeneous compu4ng Near- memory compu4ng In- memory compu4ng Beyond CMOS today 5-50x >100x! PUE is not a measure of efficiency! Heterogeneous compu4ng improves energy efficiency! Programming heterogeneous systems is (s4ll) challenging 4/11/16 2

3 Enabling FPGAs for sovware programmers Enable hardware accelerators for a larger community! FPGA development is more complex than sovware development High- level design tools (e.g. SDAccel for OpenCL)! Library accelera4on Drop- in replacement for standard sovware library Web Desktop Embedded Hardware Mobile source: stackoverflow.com/research/developer- survey /11/16 3

4 Cross- pla_orm standard sovware libraries 4/11/16 4

5 Example: Fast Fourier Transform amplitude DFT amplitude 4me frequency! FFTs are widely used DSP: spectral analysis, filter banks Data compression: MP3, JPEG ML: convolu4onal neural networks HPC: par4al differen4al equa4ons, mathema4cal finance! Common FFT Libraries (FFTW, ESSL, MKL, ) 4/11/16 5

6 Planning FFTs in sovware N/2- point DFT expand plan N/2- point DFT FFT library consists of many small FFT kernels (codelets) On- line op4miza4on: A planner picks the best composi4on (plan) by measuring the speed of different combina4ons 4/11/16 6

7 Deep pipelining for hardware FFTs N/2- point DFT compute shuffle expand fold N/2- point DFT Reconfigure the FPGA with a deep FFT pipeline Fully streamed. Linear memory access paiern Buierfly compute units. Shuffle units. 4/11/16 7

8 Heterogeneous compute libraries User GNU Radio model GNU Radio User (dynamically application linking fftw) select op)mal pla,orm here FFTW library FFTW User interposer application library Custom FFT API Custom FFT API train mapping strategy using sensors User mode driver Device driver libcxl POWER8 CPU PCIe FPGA CAPI FPGA Performance POWER system Power 4/11/16 8

9 Heterogeneous compute libraries User GNU Radio model GNU Radio User (dynamically application linking fftw) select op)mal pla,orm here FFTW library FFTW interposer library Custom FFT API Custom FFT API train mapping strategy using sensors User mode driver Device driver libcxl POWER8 CPU PCIe FPGA CAPI FPGA Performance POWER system Power 4/11/16 9

10 FFTW library interposing fftwf_complex *in0, *out0, *in1, *out1; fftwf_plan p; //allocate and initialize... p = fftwf_plan_dft_1d(n, in0, out0, FFTW_FORWARD, FFTW_ESTIMATE); fftwf_execute(p); fftwf_execute(p, in1, out1); //reuse plan User application FFTW interposer library POWER8 CPU FPGA call supported Y plan N register registered plan N return Nw_execute using sopware FFTW 4w_plan Y User applica'on N plan registered 4w_execute Y return return no4fy assemble WED & MMIO write to AFU register comple'on execute FFT on FPGA signal comple4on control flow for batched version 4/11/16 10

11 Latency for a single FFT func4on call Latency for a single CAPI FFT call is 10% higher than CPU (can be improved as the AFU is bandwidth opbmized) 4x beger compared to a PCIe version using OpenCL CPU 80 Compute Copy FPGA using CAPI 89 FPGA using PCIe (OpenCL) Run4me in micro seconds for one 4k- input complex FFT from cache 4/11/16 11

12 FFT execu4on 4me on P8 and accelerators Run4me per FFT [us] P8 (1 core, FFTW) CAPI (2 samples/cycle, non- batched, lock) CAPI (1 sample/cycle, batched, irq) CAPI (2 samples/cycle, batched, irq) Number of FFTs 4/11/16 12

13 Power trace for mul4- threaded FFT Total Power I/O Power Socket 0 Power CAPI FFT processing power on the FPGA card ~3W CPU Memory Socket 1 Power CPU Memory 4/11/16 13

14 Energy efficiency Test case: Compute 100 rounds of subsequent 4k- point FFTs in complex single precision float (1GB input samples per round) a) 1 core W = 0.21 GFLOP/W b) 12 cores 1) W = 0.31 GFLOP/W c) 12 cores 2) W = 0.12 GFLOP/W d) 1 AFU W = 3.37 GFLOP/W Result: One AFU is 2.2x faster and 16x more energy efficient compared to one core 4/11/ ) 12 threads, SMT1, DVFS off 2) 96 threads, SMT8, DVFS on

15 Conclusion and work in progress! CAPI enables Offloading of lightweight compute jobs Transparent integra4on of FPGA accelerators via shared virtual memory Energy efficient compu4ng for enterprise class servers! FPGA accelerators for cogni4ve compu4ng Regular expression matching Sparse linear algebra for graph analy4cs General convolu4on kernels for machine learning 4/11/16 15

16 Power trace for mul4- threaded FFT 4/11/16 16

A Preliminary evalua.on of OpenPOWER through op.mizing stencil based algorithms

A Preliminary evalua.on of OpenPOWER through op.mizing stencil based algorithms A Preliminary evalua.on of OpenPOWER through op.mizing stencil based algorithms Speaker: Jingheng Xu Tsinghua University Revolu'onizing the Datacenter Join the Conversa'on #OpenPOWERSummit Contents 1 About

More information

Heterogeneous Computing Systems in Cloud Datacenters

Heterogeneous Computing Systems in Cloud Datacenters FPL 2016 Lausanne, August 31 Heterogeneous Computing Systems in Cloud Datacenters Christoph Hagleitner, hle@zurich.ibm.com IBM Research - Zurich Lab IBM Research Zurich Lab (ZRL) Established in 1956 Two

More information

IBM Research: AcceleratorTechnologies in HPC and Cognitive Computing

IBM Research: AcceleratorTechnologies in HPC and Cognitive Computing MaRS Workshop, Eurosys 2017, Belgrade April 23, 2017 IBM Research: AcceleratorTechnologies in HPC and Cognitive Computing Christoph Hagleitner, hle@zurich.ibm.com IBM Research - Zurich Lab IBM Research

More information

GPUs: The Hype, The Reality, and The Future

GPUs: The Hype, The Reality, and The Future Uppsala Programming for Multicore Architectures Research Center GPUs: The Hype, The Reality, and The Future David Black- Schaffer Assistant Professor, Department of Informa

More information

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package High Performance Machine Learning Workshop Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package Matheus Souza, Lucas Maciel, Pedro Penna, Henrique Freitas 24/09/2018 Agenda Introduction

More information

Transprecision Computing

Transprecision Computing Transprecision Computing Dionysios Speaker Diamantopoulos name, Title Company/Organization Name IBM Research - Zurich Join the Conversation #OpenPOWERSummit A look into the next 15 years -8x Source: The

More information

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

SDA: Software-Defined Accelerator for Large- Scale DNN Systems SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang, 1 Shiding Lin, 1 Wei Qi, Yong Wang, Bo Yu, Song Jiang, 2 1 Baidu, Inc. 2 Wayne State University Introduction of Baidu A dominant

More information

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

SDA: Software-Defined Accelerator for Large- Scale DNN Systems SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang, 1 Shiding Lin, 1 Wei Qi, 1 Yong Wang, 1 Bo Yu, 1 Song Jiang, 2 1 Baidu, Inc. 2 Wayne State University Introduction of Baidu A

More information

POWER CAPI+SNAP+FPGA,

POWER CAPI+SNAP+FPGA, POWER CAPI+SNAP+FPGA, the powerful combination to accelerate routines explained through use cases Bruno MESNET, CAPI / OpenCAPI enablement IBM Systems Join the Conversation #OpenPOWERSummit Offload?...CAPI

More information

OpenCAPI Technology. Myron Slota Speaker name, Title OpenCAPI Consortium Company/Organization Name. Join the Conversation #OpenPOWERSummit

OpenCAPI Technology. Myron Slota Speaker name, Title OpenCAPI Consortium Company/Organization Name. Join the Conversation #OpenPOWERSummit OpenCAPI Technology Myron Slota Speaker name, Title OpenCAPI Consortium Company/Organization Name Join the Conversation #OpenPOWERSummit Industry Collaboration and Innovation OpenCAPI Topics Computation

More information

SDACCEL DEVELOPMENT ENVIRONMENT. The Xilinx SDAccel Development Environment. Bringing The Best Performance/Watt to the Data Center

SDACCEL DEVELOPMENT ENVIRONMENT. The Xilinx SDAccel Development Environment. Bringing The Best Performance/Watt to the Data Center SDAccel Environment The Xilinx SDAccel Development Environment Bringing The Best Performance/Watt to the Data Center Introduction Data center operators constantly seek more server performance. Currently

More information

OpenCAPI and its Roadmap

OpenCAPI and its Roadmap OpenCAPI and its Roadmap Myron Slota, President OpenCAPI Speaker name, Consortium Title Company/Organization Name Join the Conversation #OpenPOWERSummit Industry Collaboration and Innovation OpenCAPI and

More information

A Distributed Data- Parallel Execu3on Framework in the Kepler Scien3fic Workflow System

A Distributed Data- Parallel Execu3on Framework in the Kepler Scien3fic Workflow System A Distributed Data- Parallel Execu3on Framework in the Kepler Scien3fic Workflow System Ilkay Al(ntas and Daniel Crawl San Diego Supercomputer Center UC San Diego Jianwu Wang UMBC WorDS.sdsc.edu Computa3onal

More information

CAPI SNAP framework, the tool for C/C++ programmers to accelerate by a 2 digit factor using FPGA technology

CAPI SNAP framework, the tool for C/C++ programmers to accelerate by a 2 digit factor using FPGA technology CAPI SNAP framework, the tool for C/C++ programmers to accelerate by a 2 digit factor using FPGA technology Bruno MESNET, Power CAPI Enablement IBM Power Systems Join the Conversation #OpenPOWERSummit

More information

Industry Collaboration and Innovation

Industry Collaboration and Innovation Industry Collaboration and Innovation OpenCAPI Topics Industry Background Technology Overview Design Enablement OpenCAPI Consortium Industry Landscape Key changes occurring in our industry Historical microprocessor

More information

Industry Collaboration and Innovation

Industry Collaboration and Innovation Industry Collaboration and Innovation Industry Landscape Key changes occurring in our industry Historical microprocessor technology continues to deliver far less than the historical rate of cost/performance

More information

Overcoming the Barriers of Graphs on GPUs: Delivering Graph Analy;cs 100X Faster and 40X Cheaper

Overcoming the Barriers of Graphs on GPUs: Delivering Graph Analy;cs 100X Faster and 40X Cheaper Overcoming the Barriers of Graphs on GPUs: Delivering Graph Analy;cs 100X Faster and 40X Cheaper November 18, 2015 Super Compu3ng 2015 The Amount of Graph Data is Exploding! Billion+ Edges! 2 Graph Applications

More information

OpenPOWER Innovations for HPC. IBM Research. IWOPH workshop, ISC, Germany June 21, Christoph Hagleitner,

OpenPOWER Innovations for HPC. IBM Research. IWOPH workshop, ISC, Germany June 21, Christoph Hagleitner, IWOPH workshop, ISC, Germany June 21, 2017 OpenPOWER Innovations for HPC IBM Research Christoph Hagleitner, hle@zurich.ibm.com IBM Research - Zurich Lab IBM Research - Zurich Established in 1956 45+ different

More information

LUMOS. A Framework with Analy1cal Models for Heterogeneous Architectures. Liang Wang, and Kevin Skadron (University of Virginia)

LUMOS. A Framework with Analy1cal Models for Heterogeneous Architectures. Liang Wang, and Kevin Skadron (University of Virginia) LUMOS A Framework with Analy1cal Models for Heterogeneous Architectures Liang Wang, and Kevin Skadron (University of Virginia) What is LUMOS A set of first- order analy1cal models targe1ng heterogeneous

More information

Intel Performance Libraries

Intel Performance Libraries Intel Performance Libraries Powerful Mathematical Library Intel Math Kernel Library (Intel MKL) Energy Science & Research Engineering Design Financial Analytics Signal Processing Digital Content Creation

More information

Introduction to the OpenCAPI Interface

Introduction to the OpenCAPI Interface Introduction to the OpenCAPI Interface Brian Allison, STSM OpenCAPI Technology and Enablement Speaker name, Title Company/Organization Name Join the Conversation #OpenPOWERSummit Industry Collaboration

More information

Zhang Tianfei. Rosen Xu

Zhang Tianfei. Rosen Xu Zhang Tianfei Rosen Xu Agenda Part 1: FPGA and OPAE - Intel FPGAs and the Modern Datacenter - Platform Options and the Acceleration Stack - FPGA Hardware overview - Open Programmable Acceleration Engine

More information

Altera SDK for OpenCL

Altera SDK for OpenCL Altera SDK for OpenCL A novel SDK that opens up the world of FPGAs to today s developers Altera Technology Roadshow 2013 Today s News Altera today announces its SDK for OpenCL Altera Joins Khronos Group

More information

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli Motivation Fourier Transform widely used in Physics, Astronomy, Engineering

More information

Cloud Acceleration with FPGA s. Mike Strickland, Director, Computer & Storage BU, Altera

Cloud Acceleration with FPGA s. Mike Strickland, Director, Computer & Storage BU, Altera Cloud Acceleration with FPGA s Mike Strickland, Director, Computer & Storage BU, Altera Agenda Mission Alignment & Data Center Trends OpenCL and Algorithm Acceleration Networking Acceleration Data Access

More information

Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System

Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System Xiaofan Zhang1, Mohamed El Hadedy1, Wen-mei Hwu1, Nam Sung Kim1, Jinjun Xiong2, Deming Chen1 1 University of Illinois Urbana-Champaign

More information

OPERA. Low Power Heterogeneous Architecture for the Next Generation of Smart Infrastructure and Platforms in Industrial and Societal Applications

OPERA. Low Power Heterogeneous Architecture for the Next Generation of Smart Infrastructure and Platforms in Industrial and Societal Applications OPERA Low Power Heterogeneous Architecture for the Next Generation of Smart Infrastructure and Platforms in Industrial and Societal Applications Co-funded by the Horizon 2020 Framework Programme of the

More information

The Era of Heterogeneous Computing

The Era of Heterogeneous Computing The Era of Heterogeneous Computing EU-US Summer School on High Performance Computing New York, NY, USA June 28, 2013 Lars Koesterke: Research Staff @ TACC Nomenclature Architecture Model -------------------------------------------------------

More information

Achieving Peak Performance on Intel Hardware. Intel Software Developer Conference London, 2017

Achieving Peak Performance on Intel Hardware. Intel Software Developer Conference London, 2017 Achieving Peak Performance on Intel Hardware Intel Software Developer Conference London, 2017 Welcome Aims for the day You understand some of the critical features of Intel processors and other hardware

More information

Op#mizing MapReduce for Highly- Distributed Environments

Op#mizing MapReduce for Highly- Distributed Environments Op#mizing MapReduce for Highly- Distributed Environments Abhishek Chandra Associate Professor Department of Computer Science and Engineering University of Minnesota hep://www.cs.umn.edu/~chandra 1 Big

More information

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Chi Zhang, Viktor K Prasanna University of Southern California {zhan527, prasanna}@usc.edu fpga.usc.edu ACM

More information

Industry Collaboration and Innovation

Industry Collaboration and Innovation Industry Collaboration and Innovation Open Coherent Accelerator Processor Interface OpenCAPI TM - A New Standard for High Performance Memory, Acceleration and Networks Jeff Stuecheli April 10, 2017 What

More information

Hypervisors at Hyperscale

Hypervisors at Hyperscale Hypervisors at Hyperscale ARM, Xen, Servers and Evolution of the Data Center Larry Wikelius Co-Founder & VP Software 1 Overview l Market Dynamics l Technology Trends l Roadmaps Where are we today l Use

More information

How Might Recently Formed System Interconnect Consortia Affect PM? Doug Voigt, SNIA TC

How Might Recently Formed System Interconnect Consortia Affect PM? Doug Voigt, SNIA TC How Might Recently Formed System Interconnect Consortia Affect PM? Doug Voigt, SNIA TC Three Consortia Formed in Oct 2016 Gen-Z Open CAPI CCIX complex to rack scale memory fabric Cache coherent accelerator

More information

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok Texas Learning and Computation Center Department of Computer Science University of Houston Outline Motivation

More information

XPU A Programmable FPGA Accelerator for Diverse Workloads

XPU A Programmable FPGA Accelerator for Diverse Workloads XPU A Programmable FPGA Accelerator for Diverse Workloads Jian Ouyang, 1 (ouyangjian@baidu.com) Ephrem Wu, 2 Jing Wang, 1 Yupeng Li, 1 Hanlin Xie 1 1 Baidu, Inc. 2 Xilinx Outlines Background - FPGA for

More information

Simplify Software Integration for FPGA Accelerators with OPAE

Simplify Software Integration for FPGA Accelerators with OPAE white paper Intel FPGA Simplify Software Integration for FPGA Accelerators with OPAE Cross-Platform FPGA Programming Layer for Application Developers Authors Enno Luebbers Senior Software Engineer Intel

More information

Revolutionizing the Datacenter

Revolutionizing the Datacenter Power-Efficient Machine Learning using FPGAs on POWER Systems Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx Revolutionizing the Datacenter Join the Conversation #OpenPOWERSummit Top-5

More information

Near Memory Computing Spectral and Sparse Accelerators

Near Memory Computing Spectral and Sparse Accelerators Near Memory Computing Spectral and Sparse Accelerators Franz Franchetti ECE, Carnegie Mellon University www.ece.cmu.edu/~franzf Co-Founder, SpiralGen www.spiralgen.com The work was sponsored by Defense

More information

SDA: Software-Defined Accelerator for general-purpose big data analysis system

SDA: Software-Defined Accelerator for general-purpose big data analysis system SDA: Software-Defined Accelerator for general-purpose big data analysis system Jian Ouyang(ouyangjian@baidu.com), Wei Qi, Yong Wang, Yichen Tu, Jing Wang, Bowen Jia Baidu is beyond a search engine Search

More information

Automatic Performance Tuning. Jeremy Johnson Dept. of Computer Science Drexel University

Automatic Performance Tuning. Jeremy Johnson Dept. of Computer Science Drexel University Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University Outline Scientific Computation Kernels Matrix Multiplication Fast Fourier Transform (FFT) Automated Performance Tuning

More information

CERN openlab & IBM Research Workshop Trip Report

CERN openlab & IBM Research Workshop Trip Report CERN openlab & IBM Research Workshop Trip Report Jakob Blomer, Javier Cervantes, Pere Mato, Radu Popescu 2018-12-03 Workshop Organization 1 full day at IBM Research Zürich ~25 participants from CERN ~10

More information

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference The 2017 IEEE International Symposium on Workload Characterization Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference Shin-Ying Lee

More information

Server Side Applications (i.e., public/private Clouds and HPC)

Server Side Applications (i.e., public/private Clouds and HPC) Server Side Applications (i.e., public/private Clouds and HPC) Kathy Yelick UC Berkeley Lawrence Berkeley National Laboratory Proposed DOE Exascale Science Problems Accelerators Carbon Capture Cosmology

More information

Piz Daint: Application driven co-design of a supercomputer based on Cray s adaptive system design

Piz Daint: Application driven co-design of a supercomputer based on Cray s adaptive system design Piz Daint: Application driven co-design of a supercomputer based on Cray s adaptive system design Sadaf Alam & Thomas Schulthess CSCS & ETHzürich CUG 2014 * Timelines & releases are not precise Top 500

More information

Parallelism in Spiral

Parallelism in Spiral Parallelism in Spiral Franz Franchetti and the Spiral team (only part shown) Electrical and Computer Engineering Carnegie Mellon University Joint work with Yevgen Voronenko Markus Püschel This work was

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied

More information

Virtualization. Introduction. Why we interested? 11/28/15. Virtualiza5on provide an abstract environment to run applica5ons.

Virtualization. Introduction. Why we interested? 11/28/15. Virtualiza5on provide an abstract environment to run applica5ons. Virtualization Yifu Rong Introduction Virtualiza5on provide an abstract environment to run applica5ons. Virtualiza5on technologies have a long trail in the history of computer science. Why we interested?

More information

: Advanced Compiler Design. 8.0 Instruc?on scheduling

: Advanced Compiler Design. 8.0 Instruc?on scheduling 6-80: Advanced Compiler Design 8.0 Instruc?on scheduling Thomas R. Gross Computer Science Department ETH Zurich, Switzerland Overview 8. Instruc?on scheduling basics 8. Scheduling for ILP processors 8.

More information

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about

More information

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune PORTING CP2K TO THE INTEL XEON PHI ARCHER Technical Forum, Wed 30 th July Iain Bethune (ibethune@epcc.ed.ac.uk) Outline Xeon Phi Overview Porting CP2K to Xeon Phi Performance Results Lessons Learned Further

More information

Parallelism. CS6787 Lecture 8 Fall 2017

Parallelism. CS6787 Lecture 8 Fall 2017 Parallelism CS6787 Lecture 8 Fall 2017 So far We ve been talking about algorithms We ve been talking about ways to optimize their parameters But we haven t talked about the underlying hardware How does

More information

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620 Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved

More information

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization

More information

How to Write Fast Code , spring st Lecture, Jan. 14 th

How to Write Fast Code , spring st Lecture, Jan. 14 th How to Write Fast Code 18-645, spring 2008 1 st Lecture, Jan. 14 th Instructor: Markus Püschel TAs: Srinivas Chellappa (Vas) and Frédéric de Mesmay (Fred) Today Motivation and idea behind this course Technicalities

More information

Expressing Heterogeneous Parallelism in C++ with Intel Threading Building Blocks A full-day tutorial proposal for SC17

Expressing Heterogeneous Parallelism in C++ with Intel Threading Building Blocks A full-day tutorial proposal for SC17 Expressing Heterogeneous Parallelism in C++ with Intel Threading Building Blocks A full-day tutorial proposal for SC17 Tutorial Instructors [James Reinders, Michael J. Voss, Pablo Reble, Rafael Asenjo]

More information

Introduc)on to Xeon Phi

Introduc)on to Xeon Phi Introduc)on to Xeon Phi IXPUG 14 Lars Koesterke Acknowledgements Thanks/kudos to: Sponsor: National Science Foundation NSF Grant #OCI-1134872 Stampede Award, Enabling, Enhancing, and Extending Petascale

More information

AN 831: Intel FPGA SDK for OpenCL

AN 831: Intel FPGA SDK for OpenCL AN 831: Intel FPGA SDK for OpenCL Host Pipelined Multithread Subscribe Send Feedback Latest document on the web: PDF HTML Contents Contents 1 Intel FPGA SDK for OpenCL Host Pipelined Multithread...3 1.1

More information

Advanced CUDA Optimization 1. Introduction

Advanced CUDA Optimization 1. Introduction Advanced CUDA Optimization 1. Introduction Thomas Bradley Agenda CUDA Review Review of CUDA Architecture Programming & Memory Models Programming Environment Execution Performance Optimization Guidelines

More information

Revolutionizing the Datacenter Join the Conversation #OpenPOWERSummit

Revolutionizing the Datacenter Join the Conversation #OpenPOWERSummit Redis Labs on POWER8 Server: The Promise of OpenPOWER Value Jeffrey L. Leeds, Ph.D. Vice President, Alliances & Channels Revolutionizing the Datacenter Join the Conversation #OpenPOWERSummit Who We Are

More information

Implementing MPI on Windows: Comparison with Common Approaches on Unix

Implementing MPI on Windows: Comparison with Common Approaches on Unix Implementing MPI on Windows: Comparison with Common Approaches on Unix Jayesh Krishna, 1 Pavan Balaji, 1 Ewing Lusk, 1 Rajeev Thakur, 1 Fabian Tillier 2 1 Argonne Na+onal Laboratory, Argonne, IL, USA 2

More information

Pactron FPGA Accelerated Computing Solutions

Pactron FPGA Accelerated Computing Solutions Pactron FPGA Accelerated Computing Solutions Intel Xeon + Altera FPGA 2015 Pactron HJPC Corporation 1 Motivation for Accelerators Enhanced Performance: Accelerators compliment CPU cores to meet market

More information

Parallel FFT Program Optimizations on Heterogeneous Computers

Parallel FFT Program Optimizations on Heterogeneous Computers Parallel FFT Program Optimizations on Heterogeneous Computers Shuo Chen, Xiaoming Li Department of Electrical and Computer Engineering University of Delaware, Newark, DE 19716 Outline Part I: A Hybrid

More information

Adaptable Computing The Future of FPGA Acceleration. Dan Gibbons, VP Software Development June 6, 2018

Adaptable Computing The Future of FPGA Acceleration. Dan Gibbons, VP Software Development June 6, 2018 Adaptable Computing The Future of FPGA Acceleration Dan Gibbons, VP Software Development June 6, 2018 Adaptable Accelerated Computing Page 2 Three Big Trends The Evolution of Computing Trend to Heterogeneous

More information

Turing Architecture and CUDA 10 New Features. Minseok Lee, Developer Technology Engineer, NVIDIA

Turing Architecture and CUDA 10 New Features. Minseok Lee, Developer Technology Engineer, NVIDIA Turing Architecture and CUDA 10 New Features Minseok Lee, Developer Technology Engineer, NVIDIA Turing Architecture New SM Architecture Multi-Precision Tensor Core RT Core Turing MPS Inference Accelerated,

More information

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3

More information

S8765 Performance Optimization for Deep- Learning on the Latest POWER Systems

S8765 Performance Optimization for Deep- Learning on the Latest POWER Systems S8765 Performance Optimization for Deep- Learning on the Latest POWER Systems Khoa Huynh Senior Technical Staff Member (STSM), IBM Jonathan Samn Software Engineer, IBM Evolving from compute systems to

More information

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014 Profiling and Debugging OpenCL Applications with ARM Development Tools October 2014 1 Agenda 1. Introduction to GPU Compute 2. ARM Development Solutions 3. Mali GPU Architecture 4. Using ARM DS-5 Streamline

More information

Equinox: A C++11 platform for realtime SDR applications

Equinox: A C++11 platform for realtime SDR applications Equinox: A C++11 platform for realtime SDR applications FOSDEM 2019 Manolis Surligas surligas@csd.uoc.gr Libre Space Foundation & Computer Science Department, University of Crete Introduction Software

More information

The OpenVX Computer Vision and Neural Network Inference

The OpenVX Computer Vision and Neural Network Inference The OpenVX Computer and Neural Network Inference Standard for Portable, Efficient Code Radhakrishna Giduthuri Editor, OpenVX Khronos Group radha.giduthuri@amd.com @RadhaGiduthuri Copyright 2018 Khronos

More information

GPU Cluster Computing. Advanced Computing Center for Research and Education

GPU Cluster Computing. Advanced Computing Center for Research and Education GPU Cluster Computing Advanced Computing Center for Research and Education 1 What is GPU Computing? Gaming industry and high- defini3on graphics drove the development of fast graphics processing Use of

More information

Implica(ons of Non Vola(le Memory on So5ware Architectures. Nisha Talagala Lead Architect, Fusion- io

Implica(ons of Non Vola(le Memory on So5ware Architectures. Nisha Talagala Lead Architect, Fusion- io Implica(ons of Non Vola(le Memory on So5ware Architectures Nisha Talagala Lead Architect, Fusion- io Overview Non Vola;le Memory Technology NVM in the Datacenter Op;mizing sobware for the iomemory Tier

More information

Enabling Flexible Network FPGA Clusters in a Heterogeneous Cloud Data Center

Enabling Flexible Network FPGA Clusters in a Heterogeneous Cloud Data Center Enabling Flexible Network FPGA Clusters in a Heterogeneous Cloud Data Center Naif Tarafdar, Thomas Lin, Eric Fukuda, Hadi Bannazadeh, Alberto Leon-Garcia, Paul Chow University of Toronto 1 Cloudy with

More information

Efficient Memory and Bandwidth Management for Industrial Strength Kirchhoff Migra<on

Efficient Memory and Bandwidth Management for Industrial Strength Kirchhoff Migra<on Efficient Memory and Bandwidth Management for Industrial Strength Kirchhoff Migra

More information

Today s Lecture. CS 61C: Great Ideas in Computer Architecture (Machine Structures) Map Reduce

Today s Lecture. CS 61C: Great Ideas in Computer Architecture (Machine Structures) Map Reduce CS 61C: Great Ideas in Computer Architecture (Machine Structures) Map Reduce 8/29/12 Instructors Krste Asanovic, Randy H. Katz hgp://inst.eecs.berkeley.edu/~cs61c/fa12 Fall 2012 - - Lecture #3 1 Today

More information

SKA Computing and Software

SKA Computing and Software SKA Computing and Software Nick Rees 18 May 2016 Summary Introduc)on System overview Compu)ng Elements of the SKA Telescope Manager Low Frequency Aperture Array Central Signal Processor Science Data Processor

More information

Parallel Stochastic Gradient Descent: The case for native GPU-side GPI

Parallel Stochastic Gradient Descent: The case for native GPU-side GPI Parallel Stochastic Gradient Descent: The case for native GPU-side GPI J. Keuper Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern, Germany Mark Silberstein Accelerated Computer

More information

General Purpose GPU Computing in Partial Wave Analysis

General Purpose GPU Computing in Partial Wave Analysis JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data

More information

RapidIO.org Update.

RapidIO.org Update. RapidIO.org Update rickoco@rapidio.org June 2015 2015 RapidIO.org 1 Outline RapidIO Overview Benefits Interconnect Comparison Ecosystem System Challenges RapidIO Markets Data Center & HPC Communications

More information

GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS

GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS CIS 601 - Graduate Seminar Presentation 1 GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS PRESENTED BY HARINATH AMASA CSU ID: 2697292 What we will talk about.. Current problems GPU What are GPU Databases GPU

More information

All About the Cell Processor

All About the Cell Processor All About the Cell H. Peter Hofstee, Ph. D. IBM Systems and Technology Group SCEI/Sony Toshiba IBM Design Center Austin, Texas Acknowledgements Cell is the result of a deep partnership between SCEI/Sony,

More information

Fault Tolerant Runtime ANL. Wesley Bland Joint Lab for Petascale Compu9ng Workshop November 26, 2013

Fault Tolerant Runtime ANL. Wesley Bland Joint Lab for Petascale Compu9ng Workshop November 26, 2013 Fault Tolerant Runtime Research @ ANL Wesley Bland Joint Lab for Petascale Compu9ng Workshop November 26, 2013 Brief History of FT Checkpoint/Restart (C/R) has been around for quite a while Guards against

More information

A 101 Guide to Heterogeneous, Accelerated, Data Centric Computing Architectures

A 101 Guide to Heterogeneous, Accelerated, Data Centric Computing Architectures A 101 Guide to Heterogeneous, Accelerated, Centric Computing Architectures Allan Cantle President & Founder, Nallatech Join the Conversation #OpenPOWERSummit 2016 OpenPOWER Foundation Buzzword & Acronym

More information

ECSE 425 Lecture 25: Mul1- threading

ECSE 425 Lecture 25: Mul1- threading ECSE 425 Lecture 25: Mul1- threading H&P Chapter 3 Last Time Theore1cal and prac1cal limits of ILP Instruc1on window Branch predic1on Register renaming 2 Today Mul1- threading Chapter 3.5 Summary of ILP:

More information

7 Ways to Increase Your Produc2vity with Revolu2on R Enterprise 3.0. David Smith, REvolu2on Compu2ng

7 Ways to Increase Your Produc2vity with Revolu2on R Enterprise 3.0. David Smith, REvolu2on Compu2ng 7 Ways to Increase Your Produc2vity with Revolu2on R Enterprise 3.0 David Smith, REvolu2on Compu2ng REvolu2on Compu2ng: The R Company REvolu2on R Free, high- performance binary distribu2on of R REvolu2on

More information

Best Practices for Setting BIOS Parameters for Performance

Best Practices for Setting BIOS Parameters for Performance White Paper Best Practices for Setting BIOS Parameters for Performance Cisco UCS E5-based M3 Servers May 2013 2014 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public. Page

More information

Intel Math Kernel Library 10.3

Intel Math Kernel Library 10.3 Intel Math Kernel Library 10.3 Product Brief Intel Math Kernel Library 10.3 The Flagship High Performance Computing Math Library for Windows*, Linux*, and Mac OS* X Intel Math Kernel Library (Intel MKL)

More information

FPGAs as Streaming MIMD Machines for Data Analy9cs. James Thomas, Matei Zaharia, Pat Hanrahan

FPGAs as Streaming MIMD Machines for Data Analy9cs. James Thomas, Matei Zaharia, Pat Hanrahan FPGAs as Streaming MIMD Machines for Data Analy9cs James Thomas, Matei Zaharia, Pat Hanrahan CPU/GPU Control Flow Divergence For peak performance, CPUs and GPUs require groups of threads to have iden9cal

More information

Profiling & Tuning Applica1ons. CUDA Course July István Reguly

Profiling & Tuning Applica1ons. CUDA Course July István Reguly Profiling & Tuning Applica1ons CUDA Course July 21-25 István Reguly Introduc1on Why is my applica1on running slow? Work it out on paper Instrument code Profile it NVIDIA Visual Profiler Works with CUDA,

More information

When MPPDB Meets GPU:

When MPPDB Meets GPU: When MPPDB Meets GPU: An Extendible Framework for Acceleration Laura Chen, Le Cai, Yongyan Wang Background: Heterogeneous Computing Hardware Trend stops growing with Moore s Law Fast development of GPU

More information

Introduction to CELL B.E. and GPU Programming. Agenda

Introduction to CELL B.E. and GPU Programming. Agenda Introduction to CELL B.E. and GPU Programming Department of Electrical & Computer Engineering Rutgers University Agenda Background CELL B.E. Architecture Overview CELL B.E. Programming Environment GPU

More information

Processing Rat Brain Neuronal Signals Using a Hadoop Computing Cluster

Processing Rat Brain Neuronal Signals Using a Hadoop Computing Cluster Processing Rat Brain Neuronal Signals Using a Hadoop Computing Cluster Jadin C. Jackson, PhD Biology University of St. Thomas jadincjackson@stthomas.edu Bradley S. Rubin, PhD Graduate Programs in Software

More information

SPIRAL, FFTX, and the Path to SpectralPACK

SPIRAL, FFTX, and the Path to SpectralPACK SPIRAL, FFTX, and the Path to SpectralPACK Franz Franchetti Carnegie Mellon University www.spiral.net In collaboration with the SPIRAL and FFTX team @ CMU and LBL This work was supported by DOE ECP and

More information

Western Michigan University

Western Michigan University CS-6030 Cloud compu;ng Google App engine Sepideh Mohammadi Summer II 2017 Western Michigan University content Categories of cloud compu;ng Google cloud plaborm Google App Engine Storage technologies Datastore

More information

Unlocking FPGAs Using High- Level Synthesis Compiler Technologies

Unlocking FPGAs Using High- Level Synthesis Compiler Technologies Unlocking FPGAs Using High- Leel Synthesis Compiler Technologies Fernando Mar*nez Vallina, Henry Styles Xilinx Feb 22, 2015 Why are FPGAs Good Scalable, highly parallel and customizable compute 10s to

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

Embedded processors. Timo Töyry Department of Computer Science and Engineering Aalto University, School of Science timo.toyry(at)aalto.

Embedded processors. Timo Töyry Department of Computer Science and Engineering Aalto University, School of Science timo.toyry(at)aalto. Embedded processors Timo Töyry Department of Computer Science and Engineering Aalto University, School of Science timo.toyry(at)aalto.fi Comparing processors Evaluating processors Taxonomy of processors

More information

Accelerating computation with FPGAs

Accelerating computation with FPGAs Accelerating computation with FPGAs Michael J. Flynn Maxeler Technologies and Stanford University M. J. Flynn Maxeler Technologies 1 Based on work done by my colleagues at Maxeler, especially Oskar Mencer,

More information

PhD in Computer And Control Engineering XXVII cycle. Torino February 27th, 2015.

PhD in Computer And Control Engineering XXVII cycle. Torino February 27th, 2015. PhD in Computer And Control Engineering XXVII cycle Torino February 27th, 2015. Parallel and reconfigurable systems are more and more used in a wide number of applica7ons and environments, ranging from

More information