OpenMP Device Offloading to FPGA Accelerators. Lukas Sommer, Jens Korinth, Andreas Koch
|
|
- Octavia Richardson
- 5 years ago
- Views:
Transcription
1 OpenMP Device Offloading to FPGA Accelerators Lukas Sommer, Jens Korinth, Andreas Koch
2 Motivation Increasing use of heterogeneous systems to overcome CPU power limitations OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 2
3 Motivation Increasing use of heterogeneous systems to overcome CPU power limitations FPGAs increasingly used for implementation of accelerators in HPC systems (e.g. Microsoft Azure) OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 32
4 Motivation Increasing use of heterogeneous systems to overcome CPU power limitations FPGAs increasingly used for implementation of accelerators in HPC systems (e.g. Microsoft Azure) Programming of heterogeneous systems is non-trivial OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 42
5 Motivation Increasing use of heterogeneous systems to overcome CPU power limitations FPGAs increasingly used for implementation of accelerators in HPC systems (e.g. Microsoft Azure) Programming of heterogeneous systems is non-trivial Desirable: Programming with a single, portable code base OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 52
6 OpenMP Device Offloading #pragma omp target \ map(to:x[0:size]) \ map(tofrom:y[0:size]) { #pragma omp parallel for[...] for(i=0; i<size; i++){ Target Region y[i] = a*x[i]+y[i]; } } Denote target regions to execute on device OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 63
7 OpenMP Device Offloading #pragma omp target \ map(to:x[0:size]) \ map(tofrom:y[0:size]) { #pragma omp parallel for[...] for(i=0; i<size; i++){ Target Region y[i] = a*x[i]+y[i]; } } Denote target regions to execute on device Specify which and how data is transferred to device memory OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 73
8 OpenMP Device Offloading #pragma omp target \ map(to:x[0:size]) \ map(tofrom:y[0:size]) { #pragma omp parallel for[...] for(i=0; i<size; i++){ y[i] = a*x[i]+y[i]; } } Denote target regions to execute on device Specify which and how data is transferred to device memory Use additional parallel constructs inside target region (also targetspecific, e.g. teams, distribute,...) OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 83
9 Goal Implement mapping of target regions to FPGA accelerators in LLVM Clang Preserve FPGA-specific pragmas (e.g. Vivado HLS) Automated flow from OpenMP-annotated input program to FPGA bitstream + software executable OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 94
10 Goal Implement mapping of target regions to FPGA accelerators in LLVM Clang Preserve FPGA-specific pragmas (e.g. Vivado HLS) Automated flow from OpenMP-annotated input program to FPGA bitstream + software executable Extend LLVM OpenMP Runtime Manage data-transfers between host and FPGA Control device execution on the FPGA accelerator OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 104
11 ThreadPoolComposer Toolchain to fast-track implementation of FPGA-based accelerators in heterogeneous systems Synthesize accelerator from kernel code TPC is available as open source: OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 115
12 ThreadPoolComposer Toolchain to fast-track implementation of FPGA-based accelerators in heterogeneous systems Assemble (multiple) instances of different kernels in top-level design, combined with standardized host- and memory connection TPC is available as open source: OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 125
13 ThreadPoolComposer Toolchain to fast-track implementation of FPGA-based accelerators in heterogeneous systems Control execution and datatransfer using two-layered API Higher-level TPC API is device/platform-agnostic, allows for portable implementation (write once, run everywhere) TPC is available as open source: OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 135
14 Compilation Flow Start from a single, portable source file OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 146
15 Compilation Flow Start from a single, portable source file Standard hostcompilation, including fallback if offloading fails OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 156
16 Compilation Flow Start from a single, portable source file Standard hostcompilation, including fallback if offloading fails One device-specific compilation flow per device type Limited to extracted target regions OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 166
17 Compilation Flow Custom Clang toolchain for TPC-based offloading to FPGA accelerators Identified with new LLVM target triple Preserves FPGAspecific pragmas, e.g. Vivado HLS pragmas Yields three artifacts OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 176
18 Compilation Flow TPC-specific software binary Entry point for FPGA device execution Transfers kernel arguments and launches hardware execution using TPC API OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 186
19 Compilation Flow TPC-specific software binary Entry point for FPGA device execution Transfers kernel arguments and launches hardware execution using TPC API Included in the combined binary OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 196
20 Compilation Flow Hardware kernel code extracted from target region OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 206
21 Compilation Flow Hardware kernel code extracted from target region Description of input argument types OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 216
22 Compilation Flow Hardware kernel code extracted from target region Description of input argument types TPC automates synthesis from kernel code and description to full FPGA design No additional user input required! OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 226
23 Compilation Flow Hardware kernel code extracted from target region Description of input argument types TPC automates synthesis from kernel code and description to full FPGA design Resulting bitstream features standardized host- and memory connection OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 236
24 Host Runtime Flow Components: OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 247
25 Host Runtime Flow Components: LLVM OpenMP Runtime Infrastructure OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 257
26 Runtime Flow Host Components: LLVM OpenMP Runtime Infrastructure TPC-based plugin for LLVM OpenMP Runtime OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 267
27 Runtime Flow Host Components: LLVM OpenMP Runtime Infrastructure TPC-based plugin for LLVM OpenMP Runtime TPC-specific software binary resulting from compilation OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 277
28 Runtime Flow Host Components: LLVM OpenMP Runtime Infrastructure TPC-based plugin for LLVM OpenMP Runtime TPC-specific software binary resulting from compilation FPGA abstraction as provided by TPC OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 287
29 Host Runtime Flow Host-centric: Execution starts on the host If target region is encountered: OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 297
30 Runtime Flow Host Host-centric: Execution starts on the host If target region is encountered: Transfer data to FPGA memory OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 307
31 Runtime Flow Host Host-centric: Execution starts on the host If target region is encountered: Transfer data to FPGA memory Invoke binary Sets kernel arguments Launches hardware execution OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 317
32 Runtime Flow Host Host-centric: Execution starts on the host If target region is encountered: Transfer data to FPGA memory Invoke binary Sets kernel arguments Launches hardware execution Transfer data back to host OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 327
33 Evaluation Proof-of-concept implementation based on development version of LLVM/Clang/LLVM OpenMP Runtime Evaluation using integer BLAS kernels from Adept benchmark suite AXPY Vector scaling Vector dot product Dense matrix vector multiplication Euclidean norm for vector OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 338
34 Evaluation Using Xilinx Vivado HLS Kernels annotated with Vivado HLS pipeline pragma 250 MHz kernel operation frequency #pragma omp target \ map(to:x[0:size]) \ map(tofrom:y[0:size]) { #pragma omp parallel for[...] for(i=0; i<size; i++){ #pragma HLS PIPELINE II=1 y[i] = a*x[i]+y[i]; } } OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 349
35 Evaluation FPGA-board: VC709 (Virtex 7), 4 GiB on-board memory Single instance of each kernel ( single-core ) Comparison against x86-cpu (i7-6700k, 16 GiB DDR4- RAM) running 4 OpenMP threads OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 35 10
36 Insights Fully functional implementation of OpenMP offloading to FPGAs Custom compilation flow mapping target region to hardware kernel without additional user input Extension of runtime library based on ThreadPoolComposer API OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 36 11
37 Insights Fully functional implementation of OpenMP offloading to FPGAs Custom compilation flow mapping target region to hardware kernel without additional user input Extension of runtime library based on ThreadPoolComposer API Program a FPGA-based heterogeneous system with a single, portable code-base OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 37 11
38 Insights Fully functional implementation of OpenMP offloading to FPGAs Custom compilation flow mapping target region to hardware kernel without additional user input Extension of runtime library based on ThreadPoolComposer API Program a FPGA-based heterogeneous system with a single, portable code-base Offloading overhead primarily dependent on size of data transfered to/from device memory OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 38 11
39 Insights Fully functional implementation of OpenMP offloading to FPGAs Custom compilation flow mapping target region to hardware kernel without additional user input Extension of runtime library based on ThreadPoolComposer API Program a FPGA-based heterogeneous system with a single, portable code-base Offloading overhead primarily dependent on size of data transfered to/from device memory Pipelining results in 2x speedup over non-pipelined kernel OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 39 11
40 Insights Single PE execution slower than quad-core X86 (6.7x/3.4x) OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 40 12
41 Insights Single PE execution slower than quad-core X86 (6.7x/3.4x) Kernels very area-efficient (1-5% logic utilization) OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 412
42 Insights Single PE execution slower than quad-core X86 (6.7x/3.4x) Kernels very area-efficient (1-5% logic utilization) Future work: Make use of coarse-grain parallelism (e.g., teams distribute) Distribute computation across multiple PEs OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 42 12
43 Stop by at our poster Questions? Get in touch: Download open-source release of ThreadPoolComposer: OpenMP FPGA Device Offloading L. Sommer, J. Korinth, A. Koch TU Darmstadt 43 13
OpenMP Device Offloading to FPGA Accelerators
OpenMP Device Offloading to FPGA Accelerators Lukas Sommer Embedded Systems and Applications Group sommer@esa.tu-darmstadt.de Jens Korinth Computer Systems Group korinth@rs.tu-darmstadt.de Andreas Koch
More informationSynthesis of Interleaved Multithreaded Accelerators from OpenMP Loops
Synthesis of Interleaved Multithreaded Accelerators from OpenMP Loops Lukas Sommer, Julian Oppermann, Jaco Hofmann, and Andreas Koch Embedded Systems and Applications Group (ESA) Technische Universität
More informationOmpCloud: Bridging the Gap between OpenMP and Cloud Computing
OmpCloud: Bridging the Gap between OpenMP and Cloud Computing Hervé Yviquel, Marcio Pereira and Guido Araújo University of Campinas (UNICAMP), Brazil A bit of background qguido Araujo, PhD Princeton University
More informationFCUDA-SoC: Platform Integration for Field-Programmable SoC with the CUDAto-FPGA
1 FCUDA-SoC: Platform Integration for Field-Programmable SoC with the CUDAto-FPGA Compiler Tan Nguyen 1, Swathi Gurumani 1, Kyle Rupnow 1, Deming Chen 2 1 Advanced Digital Sciences Center, Singapore {tan.nguyen,
More informationOptimised OpenCL Workgroup Synthesis for Hybrid ARM-FPGA Devices
Optimised OpenCL Workgroup Synthesis for Hybrid ARM-FPGA Devices Mohammad Hosseinabady and Jose Luis Nunez-Yanez Department of Electrical and Electronic Engineering University of Bristol, UK. Email: {m.hosseinabady,
More informationTOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT
TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT Eric Kelmelis 28 March 2018 OVERVIEW BACKGROUND Evolution of processing hardware CROSS-PLATFORM KERNEL DEVELOPMENT Write once, target multiple hardware
More informationPorting Performance across GPUs and FPGAs
Porting Performance across GPUs and FPGAs Deming Chen, ECE, University of Illinois In collaboration with Alex Papakonstantinou 1, Karthik Gururaj 2, John Stratton 1, Jason Cong 2, Wen-Mei Hwu 1 1: ECE
More informationSDAccel Development Environment User Guide
SDAccel Development Environment User Guide Features and Development Flows Revision History The following table shows the revision history for this document. Date Version Revision 05/13/2016 2016.1 Added
More informationA Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function
A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function Chen-Ting Chang, Yu-Sheng Chen, I-Wei Wu, and Jyh-Jiun Shann Dept. of Computer Science, National Chiao
More informationA Uniform Programming Model for Petascale Computing
A Uniform Programming Model for Petascale Computing Barbara Chapman University of Houston WPSE 2009, Tsukuba March 25, 2009 High Performance Computing and Tools Group http://www.cs.uh.edu/~hpctools Agenda
More informationOptimizing FPGA-based Accelerator Design for Deep Convolutional Neural Network
Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Network Chen Zhang 1, Peng Li 3, Guangyu Sun 1,2, Yijin Guan 1, Bingjun Xiao 3, Jason Cong 1,2,3 1 Peking University 2 PKU/UCLA Joint
More informationExploring OpenCL Memory Throughput on the Zynq
Exploring OpenCL Memory Throughput on the Zynq Technical Report no. 2016:04, ISSN 1652-926X Chalmers University of Technology Bo Joel Svensson bo.joel.svensson@gmail.com Abstract The Zynq platform combines
More informationFCUDA: Enabling Efficient Compilation of CUDA Kernels onto
FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FPGAs October 13, 2009 Overview Presenting: Alex Papakonstantinou, Karthik Gururaj, John Stratton, Jason Cong, Deming Chen, Wen-mei Hwu. FCUDA:
More informationPortability of OpenMP Offload Directives Jeff Larkin, OpenMP Booth Talk SC17
Portability of OpenMP Offload Directives Jeff Larkin, OpenMP Booth Talk SC17 11/27/2017 Background Many developers choose OpenMP in hopes of having a single source code that runs effectively anywhere (performance
More informationOPENMP GPU OFFLOAD IN FLANG AND LLVM. Guray Ozen, Simone Atzeni, Michael Wolfe Annemarie Southwell, Gary Klimowicz
OPENMP GPU OFFLOAD IN FLANG AND LLVM Guray Ozen, Simone Atzeni, Michael Wolfe Annemarie Southwell, Gary Klimowicz MOTIVATION What does HPC programmer need today? Performance à GPUs, multi-cores, other
More informationFPGA Acceleration of the LFRic Weather and Climate Model in the EuroExa Project Using Vivado HLS
FPGA Acceleration of the LFRic Weather and Climate Model in the EuroExa Project Using Vivado HLS Mike Ashworth, Graham Riley, Andrew Attwood and John Mawer Advanced Processor Technologies Group School
More informationOpenMP on the FDSM software distributed shared memory. Hiroya Matsuba Yutaka Ishikawa
OpenMP on the FDSM software distributed shared memory Hiroya Matsuba Yutaka Ishikawa 1 2 Software DSM OpenMP programs usually run on the shared memory computers OpenMP programs work on the distributed
More informationECE5775 High-Level Digital Design Automation, Fall 2018 School of Electrical Computer Engineering, Cornell University
ECE5775 High-Level Digital Design Automation, Fall 2018 School of Electrical Computer Engineering, Cornell University Lab 4: Binarized Convolutional Neural Networks Due Wednesday, October 31, 2018, 11:59pm
More informationThe Design and Implementation of OpenMP 4.5 and OpenACC Backends for the RAJA C++ Performance Portability Layer
The Design and Implementation of OpenMP 4.5 and OpenACC Backends for the RAJA C++ Performance Portability Layer William Killian Tom Scogland, Adam Kunen John Cavazos Millersville University of Pennsylvania
More informationExperiences with Achieving Portability across Heterogeneous Architectures
Experiences with Achieving Portability across Heterogeneous Architectures Lukasz G. Szafaryn +, Todd Gamblin ++, Bronis R. de Supinski ++ and Kevin Skadron + + University of Virginia ++ Lawrence Livermore
More informationEfficient Hardware Acceleration on SoC- FPGA using OpenCL
Efficient Hardware Acceleration on SoC- FPGA using OpenCL Advisor : Dr. Benjamin Carrion Schafer Susmitha Gogineni 30 th August 17 Presentation Overview 1.Objective & Motivation 2.Configurable SoC -FPGA
More informationBurrows-Wheeler Short Read Aligner on AWS EC2 F1 Instances
University of Virginia High-Performance Low-Power Lab Prof. Dr. Mircea Stan Burrows-Wheeler Short Read Aligner on AWS EC2 F1 Instances Smith-Waterman Extension on FPGA(s) Sergiu Mosanu, Kevin Skadron and
More informationS Comparing OpenACC 2.5 and OpenMP 4.5
April 4-7, 2016 Silicon Valley S6410 - Comparing OpenACC 2.5 and OpenMP 4.5 James Beyer, NVIDIA Jeff Larkin, NVIDIA GTC16 April 7, 2016 History of OpenMP & OpenACC AGENDA Philosophical Differences Technical
More informationISim Hardware Co-Simulation Tutorial: Accelerating Floating Point Fast Fourier Transform Simulation
ISim Hardware Co-Simulation Tutorial: Accelerating Floating Point Fast Fourier Transform Simulation UG817 (v 13.2) July 28, 2011 Xilinx is disclosing this user guide, manual, release note, and/or specification
More informationAdvanced OpenMP Features
Christian Terboven, Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group {terboven,schmidl@itc.rwth-aachen.de IT Center der RWTH Aachen University Vectorization 2 Vectorization SIMD =
More informationElaborazione dati real-time su architetture embedded many-core e FPGA
Elaborazione dati real-time su architetture embedded many-core e FPGA DAVIDE ROSSI A L E S S A N D R O C A P O T O N D I G I U S E P P E T A G L I A V I N I A N D R E A M A R O N G I U C I R I - I C T
More informationMartin Kruliš, v
Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal
More informationSDSoC: Session 1
SDSoC: Session 1 ADAM@ADIUVOENGINEERING.COM What is SDSoC SDSoC is a system optimising compiler which allows us to optimise Zynq PS / PL Zynq MPSoC PS / PL MicroBlaze What does this mean? Following the
More informationAn Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware
An Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware Tao Chen, Shreesha Srinath Christopher Batten, G. Edward Suh Computer Systems Laboratory School of Electrical
More informationIntroduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines
Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines What is OpenMP? What does OpenMP stands for? What does OpenMP stands for? Open specifications for Multi
More informationOpenACC (Open Accelerators - Introduced in 2012)
OpenACC (Open Accelerators - Introduced in 2012) Open, portable standard for parallel computing (Cray, CAPS, Nvidia and PGI); introduced in 2012; GNU has an incomplete implementation. Uses directives in
More informationNVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU
NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU GPGPU opens the door for co-design HPC, moreover middleware-support embedded system designs to harness the power of GPUaccelerated
More informationThe Role of Standards in Heterogeneous Programming
The Role of Standards in Heterogeneous Programming Multi-core Challenge Bristol UWE 45 York Place, Edinburgh EH1 3HP June 12th, 2013 Codeplay Software Ltd. Incorporated in 1999 Based in Edinburgh, Scotland
More informationFPGA Acceleration of the LFRic Weather and Climate Model in the EuroExa Project Using Vivado HLS
FPGA Acceleration of the LFRic Weather and Climate Model in the EuroExa Project Using Vivado HLS Mike Ashworth, Graham Riley, Andrew Attwood and John Mawer Advanced Processor Technologies Group School
More informationA case study of performance portability with OpenMP 4.5
A case study of performance portability with OpenMP 4.5 Rahul Gayatri, Charlene Yang, Thorsten Kurth, Jack Deslippe NERSC pre-print copy 1 Outline General Plasmon Pole (GPP) application from BerkeleyGW
More informationModern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design
Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant
More informationOpenCL TM & OpenMP Offload on Sitara TM AM57x Processors
OpenCL TM & OpenMP Offload on Sitara TM AM57x Processors 1 Agenda OpenCL Overview of Platform, Execution and Memory models Mapping these models to AM57x Overview of OpenMP Offload Model Compare and contrast
More informationIntel HLS Compiler: Fast Design, Coding, and Hardware
white paper Intel HLS Compiler Intel HLS Compiler: Fast Design, Coding, and Hardware The Modern FPGA Workflow Authors Melissa Sussmann HLS Product Manager Intel Corporation Tom Hill OpenCL Product Manager
More informationAccelerating Multi-core Processor Design Space Evaluation Using Automatic Multi-threaded Workload Synthesis
Accelerating Multi-core Processor Design Space Evaluation Using Automatic Multi-threaded Workload Synthesis Clay Hughes & Tao Li Department of Electrical and Computer Engineering University of Florida
More informationFCUDA: Enabling Efficient Compilation of CUDA Kernels onto
FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FPGAs October 13, 2009 Overview Presenting: Alex Papakonstantinou, Karthik Gururaj, John Stratton, Jason Cong, Deming Chen, Wen-mei Hwu. FCUDA:
More informationAddressing Heterogeneity in Manycore Applications
Addressing Heterogeneity in Manycore Applications RTM Simulation Use Case stephane.bihan@caps-entreprise.com Oil&Gas HPC Workshop Rice University, Houston, March 2008 www.caps-entreprise.com Introduction
More informationIs OpenMP 4.5 Target Off-load Ready for Real Life? A Case Study of Three Benchmark Kernels
National Aeronautics and Space Administration Is OpenMP 4.5 Target Off-load Ready for Real Life? A Case Study of Three Benchmark Kernels Jose M. Monsalve Diaz (UDEL), Gabriele Jost (NASA), Sunita Chandrasekaran
More informationVivado HLx Design Entry. June 2016
Vivado HLx Design Entry June 2016 Agenda What is the HLx Design Methodology? New & Early Access features for Connectivity Platforms Creating Differentiated Logic 2 What is the HLx Design Methodology? Page
More informationLLVM for the future of Supercomputing
LLVM for the future of Supercomputing Hal Finkel hfinkel@anl.gov 2017-03-27 2017 European LLVM Developers' Meeting What is Supercomputing? Computing for large, tightly-coupled problems. Lots of computational
More informationSerial and Parallel Sobel Filtering for multimedia applications
Serial and Parallel Sobel Filtering for multimedia applications Gunay Abdullayeva Institute of Computer Science University of Tartu Email: gunay@ut.ee Abstract GSteamer contains various plugins to apply
More informationLecture: Manycore GPU Architectures and Programming, Part 4 -- Introducing OpenMP and HOMP for Accelerators
Lecture: Manycore GPU Architectures and Programming, Part 4 -- Introducing OpenMP and HOMP for Accelerators CSCE 569 Parallel Computing Department of Computer Science and Engineering Yonghong Yan yanyh@cse.sc.edu
More informationProgramming Models for Multi- Threading. Brian Marshall, Advanced Research Computing
Programming Models for Multi- Threading Brian Marshall, Advanced Research Computing Why Do Parallel Computing? Limits of single CPU computing performance available memory I/O rates Parallel computing allows
More informationDatabase Acceleration Solution Using FPGAs and Integrated Flash Storage
Database Acceleration Solution Using FPGAs and Integrated Flash Storage HK Verma, Xilinx Inc. August 2017 1 FPGA Analytics in Flash Storage System In-memory or Flash storage based DB reduce disk access
More informationRecent Advances in Heterogeneous Computing using Charm++
Recent Advances in Heterogeneous Computing using Charm++ Jaemin Choi, Michael Robson Parallel Programming Laboratory University of Illinois Urbana-Champaign April 12, 2018 1 / 24 Heterogeneous Computing
More informationA Lost Cycles Analysis for Performance Prediction using High-Level Synthesis
A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis Bruno da Silva, Jan Lemeire, An Braeken, and Abdellah Touhafi Vrije Universiteit Brussel (VUB), INDI and ETRO department, Brussels,
More informationAn Extension of the StarSs Programming Model for Platforms with Multiple GPUs
An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé 2 Rosa M. Badia 2 Francisco Igual 1 Jesús Labarta 2 Rafael Mayo 1 Enrique S. Quintana-Ortí 1 1 Departamento
More informationFahad Zafar, Dibyajyoti Ghosh, Lawrence Sebald, Shujia Zhou. University of Maryland Baltimore County
Accelerating a climate physics model with OpenCL Fahad Zafar, Dibyajyoti Ghosh, Lawrence Sebald, Shujia Zhou University of Maryland Baltimore County Introduction The demand to increase forecast predictability
More informationOpenACC Standard. Credits 19/07/ OpenACC, Directives for Accelerators, Nvidia Slideware
OpenACC Standard Directives for Accelerators Credits http://www.openacc.org/ o V1.0: November 2011 Specification OpenACC, Directives for Accelerators, Nvidia Slideware CAPS OpenACC Compiler, HMPP Workbench
More informationReview of previous examinations TMA4280 Introduction to Supercomputing
Review of previous examinations TMA4280 Introduction to Supercomputing NTNU, IMF April 24. 2017 1 Examination The examination is usually comprised of: one problem related to linear algebra operations with
More informationCOMP Parallel Computing. Programming Accelerators using Directives
COMP 633 - Parallel Computing Lecture 15 October 30, 2018 Programming Accelerators using Directives Credits: Introduction to OpenACC and toolkit Jeff Larkin, Nvidia COMP 633 - Prins Directives for Accelerator
More informationEvaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices
Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices Jonas Hahnfeld 1, Christian Terboven 1, James Price 2, Hans Joachim Pflug 1, Matthias S. Müller
More informationMIGRATION OF LEGACY APPLICATIONS TO HETEROGENEOUS ARCHITECTURES Francois Bodin, CTO, CAPS Entreprise. June 2011
MIGRATION OF LEGACY APPLICATIONS TO HETEROGENEOUS ARCHITECTURES Francois Bodin, CTO, CAPS Entreprise June 2011 FREE LUNCH IS OVER, CODES HAVE TO MIGRATE! Many existing legacy codes needs to migrate to
More informationSDA: Software-Defined Accelerator for general-purpose big data analysis system
SDA: Software-Defined Accelerator for general-purpose big data analysis system Jian Ouyang(ouyangjian@baidu.com), Wei Qi, Yong Wang, Yichen Tu, Jing Wang, Bowen Jia Baidu is beyond a search engine Search
More informationBig Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures
Procedia Computer Science Volume 51, 2015, Pages 2774 2778 ICCS 2015 International Conference On Computational Science Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid
More informationEXOCHI: Architecture and Programming Environment for A Heterogeneous Multicore Multithreaded System
EXOCHI: Architecture and Programming Environment for A Heterogeneous Multicore Multithreaded System By Perry H. Wang, Jamison D. Collins, Gautham N. Chinya, Hong Jiang, Xinmin Tian, Milind Girkar, Nick
More informationLIQUID METAL Taming Heterogeneity
LIQUID METAL Taming Heterogeneity Stephen Fink IBM Research! IBM Research Liquid Metal Team (IBM T. J. Watson Research Center) Josh Auerbach Perry Cheng 2 David Bacon Stephen Fink Ioana Baldini Rodric
More informationElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests
ElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests Mingxing Tan 1 2, Gai Liu 1, Ritchie Zhao 1, Steve Dai 1, Zhiru Zhang 1 1 Computer Systems Laboratory, Electrical and Computer
More informationFPGAs as Components in Heterogeneous HPC Systems: Raising the Abstraction Level of Heterogeneous Programming
FPGAs as Components in Heterogeneous HPC Systems: Raising the Abstraction Level of Heterogeneous Programming Wim Vanderbauwhede School of Computing Science University of Glasgow A trip down memory lane
More informationParallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008
Parallel Computing Using OpenMP/MPI Presented by - Jyotsna 29/01/2008 Serial Computing Serially solving a problem Parallel Computing Parallelly solving a problem Parallel Computer Memory Architecture Shared
More informationA Parallelizing Compiler for Multicore Systems
A Parallelizing Compiler for Multicore Systems José M. Andión, Manuel Arenaz, Gabriel Rodríguez and Juan Touriño 17th International Workshop on Software and Compilers for Embedded Systems (SCOPES 2014)
More informationSHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008
SHARCNET Workshop on Parallel Computing Hugh Merz Laurentian University May 2008 What is Parallel Computing? A computational method that utilizes multiple processing elements to solve a problem in tandem
More informationNEW FPGA DESIGN AND VERIFICATION TECHNIQUES MICHAL HUSEJKO IT-PES-ES
NEW FPGA DESIGN AND VERIFICATION TECHNIQUES MICHAL HUSEJKO IT-PES-ES Design: Part 1 High Level Synthesis (Xilinx Vivado HLS) Part 2 SDSoC (Xilinx, HLS + ARM) Part 3 OpenCL (Altera OpenCL SDK) Verification:
More informationCMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)
CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can
More informationGPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler
GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler Taylor Lloyd Jose Nelson Amaral Ettore Tiotto University of Alberta University of Alberta IBM Canada 1 Why? 2 Supercomputer Power/Performance GPUs
More informationTrends and Challenges in Multicore Programming
Trends and Challenges in Multicore Programming Eva Burrows Bergen Language Design Laboratory (BLDL) Department of Informatics, University of Bergen Bergen, March 17, 2010 Outline The Roadmap of Multicores
More informationOpenMP * 4 Support in Clang * / LLVM * Andrey Bokhanko, Intel
OpenMP * 4 Support in Clang * / LLVM * Andrey Bokhanko, Intel Clang * : An Excellent C++ Compiler LLVM * : Collection of modular and reusable compiler and toolchain technologies Created by Chris Lattner
More informationCUDA GPGPU Workshop 2012
CUDA GPGPU Workshop 2012 Parallel Programming: C thread, Open MP, and Open MPI Presenter: Nasrin Sultana Wichita State University 07/10/2012 Parallel Programming: Open MP, MPI, Open MPI & CUDA Outline
More informationTransprecision Computing
Transprecision Computing Dionysios Speaker Diamantopoulos name, Title Company/Organization Name IBM Research - Zurich Join the Conversation #OpenPOWERSummit A look into the next 15 years -8x Source: The
More informationTed N. Booth. DesignLinx Hardware Solutions
Ted N. Booth DesignLinx Hardware Solutions September 2015 Using Vivado HLS for Video Algorithm Implementation for Demonstration and Validation Agenda Project Description HLS Lessons Learned Summary Project
More informationDid I Just Do That on a Bunch of FPGAs?
Did I Just Do That on a Bunch of FPGAs? Paul Chow High-Performance Reconfigurable Computing Group Department of Electrical and Computer Engineering University of Toronto About the Talk Title It s the measure
More informationFCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow
FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow Abstract: High-level synthesis (HLS) of data-parallel input languages, such as the Compute Unified Device Architecture
More informationPOWER-AWARE SOFTWARE ON ARM. Paul Fox
POWER-AWARE SOFTWARE ON ARM Paul Fox OUTLINE MOTIVATION LINUX POWER MANAGEMENT INTERFACES A UNIFIED POWER MANAGEMENT SYSTEM EXPERIMENTAL RESULTS AND FUTURE WORK 2 MOTIVATION MOTIVATION» ARM SoCs designed
More informationExercise: OpenMP Programming
Exercise: OpenMP Programming Multicore programming with OpenMP 19.04.2016 A. Marongiu - amarongiu@iis.ee.ethz.ch D. Palossi dpalossi@iis.ee.ethz.ch ETH zürich Odroid Board Board Specs Exynos5 Octa Cortex
More informationFiPS and M2DC: Novel Architectures for Reconfigurable Hyperscale Servers
FiPS and M2DC: Novel Architectures for Reconfigurable Hyperscale Servers Rene Griessl, Meysam Peykanu, Lennart Tigges, Jens Hagemeyer, Mario Porrmann Center of Excellence Cognitive Interaction Technology
More informationAccelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors
Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors Michael Boyer, David Tarjan, Scott T. Acton, and Kevin Skadron University of Virginia IPDPS 2009 Outline Leukocyte
More informationMultigrain Parallelism: Bridging Coarse- Grain Parallel Languages and Fine-Grain Event-Driven Multithreading
Department of Electrical and Computer Engineering Computer Architecture and Parallel Systems Laboratory - CAPSL Multigrain Parallelism: Bridging Coarse- Grain Parallel Languages and Fine-Grain Event-Driven
More informationScientific Programming in C XIV. Parallel programming
Scientific Programming in C XIV. Parallel programming Susi Lehtola 11 December 2012 Introduction The development of microchips will soon reach the fundamental physical limits of operation quantum coherence
More informationIWOMP Dresden, Germany
IWOMP 2009 Dresden, Germany Providing Observability for OpenMP 3.0 Applications Yuan Lin, Oleg Mazurov Overview Objective The Data Model OpenMP Runtime API for Profiling Collecting Data Examples Overhead
More informationExploring Task Parallelism for Heterogeneous Systems Using Multicore Task Management API
EuroPAR 2016 ROME Workshop Exploring Task Parallelism for Heterogeneous Systems Using Multicore Task Management API Suyang Zhu 1, Sunita Chandrasekaran 2, Peng Sun 1, Barbara Chapman 1, Marcus Winter 3,
More informationVivado Design Suite User Guide
Vivado Design Suite User Guide Design Flows Overview Notice of Disclaimer The information disclosed to you hereunder (the Materials ) is provided solely for the selection and use of Xilinx products. To
More informationPerformance Issues in Parallelization. Saman Amarasinghe Fall 2010
Performance Issues in Parallelization Saman Amarasinghe Fall 2010 Today s Lecture Performance Issues of Parallelism Cilk provides a robust environment for parallelization It hides many issues and tries
More informationCOMP528: Multi-core and Multi-Processor Computing
COMP528: Multi-core and Multi-Processor Computing Dr Michael K Bane, G14, Computer Science, University of Liverpool m.k.bane@liverpool.ac.uk https://cgi.csc.liv.ac.uk/~mkbane/comp528 2X So far Why and
More informationParallel Systems. Project topics
Parallel Systems Project topics 2016-2017 1. Scheduling Scheduling is a common problem which however is NP-complete, so that we are never sure about the optimality of the solution. Parallelisation is a
More informationIntroduction to OpenACC
Introduction to OpenACC Alexander B. Pacheco User Services Consultant LSU HPC & LONI sys-help@loni.org HPC Training Spring 2014 Louisiana State University Baton Rouge March 26, 2014 Introduction to OpenACC
More informationSynergy: A HW/SW Framework for High Throughput CNNs on Embedded Heterogeneous SoC
Synergy: A HW/SW Framework for High Throughput CNNs on Embedded Heterogeneous SoC Guanwen Zhong, Akshat Dubey, Tan Cheng, and Tulika Mitra arxiv:1804.00706v1 [cs.dc] 28 Mar 2018 School of Computing, National
More informationSODA: Stencil with Optimized Dataflow Architecture Yuze Chi, Jason Cong, Peng Wei, Peipei Zhou
SODA: Stencil with Optimized Dataflow Architecture Yuze Chi, Jason Cong, Peng Wei, Peipei Zhou University of California, Los Angeles 1 What is stencil computation? 2 What is Stencil Computation? A sliding
More informationTutorial on Software-Hardware Codesign with CORDIC
ECE5775 High-Level Digital Design Automation, Fall 2017 School of Electrical Computer Engineering, Cornell University Tutorial on Software-Hardware Codesign with CORDIC 1 Introduction So far in ECE5775
More informationUnified Memory. Notes on GPU Data Transfers. Andreas Herten, Forschungszentrum Jülich, 24 April Member of the Helmholtz Association
Unified Memory Notes on GPU Data Transfers Andreas Herten, Forschungszentrum Jülich, 24 April 2017 Handout Version Overview, Outline Overview Unified Memory enables easy access to GPU development But some
More informationOpenMP 4.5 Validation and
OpenMP 4.5 Validation and Verification Suite Sunita Chandrasekaran (schandra@udel.edu) Jose Diaz (josem@udel.edu), ORNL: Oscar Hernandez (oscar@ornl.gov), Swaroop Pophale (pophaless@ornl.gov) or David
More informationHPVM: Heterogeneous Parallel Virtual Machine
HPVM: Heterogeneous Parallel Virtual Machine Maria Kotsifakou Department of Computer Science University of Illinois at Urbana-Champaign kotsifa2@illinois.edu Prakalp Srivastava Department of Computer Science
More informationPCERE: Fine-grained Parallel Benchmark Decomposition for Scalability Prediction
PCERE: Fine-grained Parallel Benchmark Decomposition for Scalability Prediction Mihail Popov, Chadi kel, Florent Conti, William Jalby, Pablo de Oliveira Castro UVSQ - PRiSM - ECR Mai 28, 2015 Introduction
More informationOpenMP 4.0/4.5: New Features and Protocols. Jemmy Hu
OpenMP 4.0/4.5: New Features and Protocols Jemmy Hu SHARCNET HPC Consultant University of Waterloo May 10, 2017 General Interest Seminar Outline OpenMP overview Task constructs in OpenMP SIMP constructs
More informationD-TEC DSL Technology for Exascale Computing
D-TEC DSL Technology for Exascale Computing Progress Report: March 2013 DOE Office of Science Program: Office of Advanced Scientific Computing Research ASCR Program Manager: Dr. Sonia Sachs 1 Introduction
More informationLab 4: Convolutional Neural Networks Due Friday, November 3, 2017, 11:59pm
ECE5775 High-Level Digital Design Automation, Fall 2017 School of Electrical Computer Engineering, Cornell University Lab 4: Convolutional Neural Networks Due Friday, November 3, 2017, 11:59pm 1 Introduction
More informationPiecewise Holistic Autotuning of Compiler and Runtime Parameters
Piecewise Holistic Autotuning of Compiler and Runtime Parameters Mihail Popov, Chadi Akel, William Jalby, Pablo de Oliveira Castro University of Versailles Exascale Computing Research August 2016 C E R
More information