Evaluating the Potential of Graphics Processors for High Performance Embedded Computing

Similar documents
Evaluating the Potential of Graphics Processors for High Performance Embedded Computing

The HPEC Challenge Benchmark Suite

Kernel Benchmarks and Metrics for Polymorphous Computer Architectures

Tesla Architecture, CUDA and Optimization Strategies

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Computational Process Networks

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

High-Performance Packet Classification on GPU

Gedae cwcembedded.com. The CHAMP-AV6 VPX-REDI. Digital Signal Processing Card. Maximizing Performance with Minimal Porting Effort

A Multi-Tiered Optimization Framework for Heterogeneous Computing

Introduction to GPGPU and GPU-architectures

High Performance DoD DSP Applications

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve

Spiral. Computer Generation of Performance Libraries. José M. F. Moura Markus Püschel Franz Franchetti & the Spiral Team. Performance.

XPU A Programmable FPGA Accelerator for Diverse Workloads

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU

QR Decomposition on GPUs

Towards a Performance- Portable FFT Library for Heterogeneous Computing

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

A Framework for Real-Time High-Throughput Signal and Image Processing Systems on Workstations

Godson Processor and its Application in High Performance Computers

Optimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Multi-DSP Parallel Processing Platform for Hyperspectral Anomaly Detection

J. Blair Perot. Ali Khajeh-Saeed. Software Engineer CD-adapco. Mechanical Engineering UMASS, Amherst

Portland State University ECE 588/688. Graphics Processors

Radar Signal Processing with Graphics Processors (GPUs)

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

PARALLEL PROGRAMMING MANY-CORE COMPUTING: THE LOFAR SOFTWARE TELESCOPE (5/5)

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

High Performance Computing on GPUs using NVIDIA CUDA

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Dense Linear Algebra. HPC - Algorithms and Applications

Highly Efficient Compensationbased Parallelism for Wavefront Loops on GPUs

Fundamental CUDA Optimization. NVIDIA Corporation

Why Parallel Architecture

Application Performance on Dual Processor Cluster Nodes

Auto-tunable GPU BLAS

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation

Evaluating MMX Technology Using DSP and Multimedia Applications

Intel Enterprise Processors Technology

Graphics Processing Unit (GPU) Acceleration of Machine Vision Software for Space Flight Applications

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Hybrid Implementation of 3D Kirchhoff Migration

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

88X + PERFORMANCE GAINS USING IBM DB2 WITH BLU ACCELERATION ON INTEL TECHNOLOGY

Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives

GPU-Based Real-Time SAS Processing On-Board Autonomous Underwater Vehicles

CUDA OPTIMIZATIONS ISC 2011 Tutorial

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Parallelism in Spiral

Automatic Development of Linear Algebra Libraries for the Tesla Series

GPU Fundamentals Jeff Larkin November 14, 2016

Benchmarking Real-World In-Vehicle Applications

Performing Multi-Phased Radar Processing with a Very Deep FPGA Pipeline

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

HPC with GPU and its applications from Inspur. Haibo Xie, Ph.D

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

LLMORE: Mapping and Optimization Framework

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

SDA: Software-Defined Accelerator for general-purpose big data analysis system

When MPPDB Meets GPU:

A Stream Compiler for Communication-Exposed Architectures

Asynchronous OpenCL/MPI numerical simulations of conservation laws

Introduction to GPU computing

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer

John W. Romein. Netherlands Institute for Radio Astronomy (ASTRON) Dwingeloo, the Netherlands

Numerical Simulation on the GPU

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

PARALLEL PROGRAMMING MANY-CORE COMPUTING FOR THE LOFAR TELESCOPE ROB VAN NIEUWPOORT. Rob van Nieuwpoort

Exploring GPU Architecture for N2P Image Processing Algorithms

Using GPUs to Accelerate Synthetic Aperture Sonar Imaging via Backpropagation

Computer Systems Architecture I. CSE 560M Lecture 19 Prof. Patrick Crowley

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

Fundamental CUDA Optimization. NVIDIA Corporation

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca

Implementing a Speech Recognition System on a GPU using CUDA. Presented by Omid Talakoub Astrid Yi

Instruction Set Extensions for Photonic Synchronous Coalesced Access

Parallelising Pipelined Wavefront Computations on the GPU

High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT

Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand

Large scale Imaging on Current Many- Core Platforms

Automated Finite Element Computations in the FEniCS Framework using GPUs

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

Hermes: An Integrated CPU/GPU Microarchitecture for IP Routing

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Transcription:

Evaluating the Potential of Graphics Processors for High Performance Embedded Computing Shuai Mu, Chenxi Wang, Ming Liu, Yangdong Deng Department of Micro-/Nano-electronics Tsinghua University

Outline Motivation HPEC Implementation and Evaluation Kernel Benchmarks Synthetic Aperture Radar Performance Comparison Conclusion 2 2

HPEC: High Performance Embedded Computing Future IT infrastructure demands even higher computing power High performance radar : 800GFLOPs(Giga FLoating point Operations Per second) 4G wireless base station: 1Gbit/s data rate per customer and up to 200 subscribers in service area CMU driverless car: 270GFLOPs 3 3

Implication An increasing number of high performance embedded applications would be implemented with multi-core devices Intel: cluster based Internet routers IBM: signal processing and radar applications on Cell processor Huawei: multi-core base stations Systematically evaluating the potential of GPU Performance Scalability 4 4

HPEC Challenge Benchmark Developed by MIT Lincoln Laboratory* Quantitatively evaluate HPEC systems Kernel benchmarks: extracted from a broad range of signal and image processing application 5 * The HPEC Challenge Benchmark Suite, R. Haney, T. Meuse, J. Kepner, HPEC 2006 5

Kernel Benchmarks Category Benchmark Description TDFIR Time-domain finite impulse response filtering FDFIR Frequency-domain finite impulse response filtering Signal Processing QR SVD QR factorization: prevalent in target recognition algorithms Singular value decomposition: produces a basis for the matrix as well as the rank for reducing interference CFAR Constant false-alarm rate detection: find target in an environment with varying background noise Communication CT Corner turn or matrix transpose to place radar data into a contiguous row for efficient FFT PM Pattern Matching: identify stored tracks that match a target Information Processing GA Graph optimization via genetic algorithm: removing uncorrelated data relations 6 DB Database operations to store and query target tracks 6

7 Benchmark Properties Benchmark TD FIR FD FIR CT PM CFAR GA QR SVD DB Data Set Set1 Set2 Set1 Set2 Set1 Set2 Set1 Set2 Set1 Set2 Set3 Set4 Set1 Set2 Set3 Set4 Set1 Set2 Set3 Set1 Set2 Set1 Set2 Workload (MFLOPS)* 268.4 1.97 34 2.21 2 30 1.21 13.59 0.17 150.5 41.1 17.7 0.011 0.51 0.015 0.11 397 30.5 45 0.24 0.88 440 700 Task-Level Parallelism 64 20 64 20 1 1 72 256 384 6144 3072 480 50 200 100 400 1 1 1 1 1 Data Structure Vector Vector Matrix Vector Vector Vector Matrix Matrix Data Size 4096 1024 4096 1024 50x5000 750x5000 64 128 64 3500 1909 9900 8 96 5 10 500x100 180x60 150x150 500x100 180x60 * The workload of CT and DB are measured in MB and Transactions, respectively 1 1 Tree 440 700 Data Correlation Low Low Very Low Low Medium Medium High High High Memory Access Low Low Very High Low Low High Medium Medium Very High 7

Implementation on GPU (1) Plenty of data level parallelism Raw computing power Loops of multiplication and accumulation (MAC) TDFIR FDFIR CFAR 8 8

Implementation on GPU (2) Plenty of task level parallelism Synchronization between blocks PM GA 9 9

Implementations on GPU (3) Memory accessing operation Global memory accessing coalescing Shared memory for local operation Database 10 CT DB 10

Implementations on GPU (4) Advanced linear algebra operation Hard to explore parallelism Pipelining the row updates of matrix QR SVD 11 (a) Threads assignment (b) Step 1 (c) Step 2 and 3 11

Experiment Environment 12 CPU Intel Core2 Duo 2.66GHz 4GB memory GPU NVidia Tesla C2050 : 448cores,1.15GHz 3GB memory DSP ADSP-TS101S Tiger SHARC T2-PCI 8 DSP processor, 600MHz 24Mbits on-chip memory per DSP 12

Performance Comparison Kernels TDFIR FDFIR CT PM CFAR GA QR SVD DB Data Set Set 1 Set 2 Set 1 Set 2 Set 1 Set 2 Set 1 Set 2 Set 1 Set 2 Set 3 Set 4 Set 1 Set 2 Set 3 Set 4 Set 1 Set 2 Set 3 Set 1 Set 2 Set 1 Set 2 DSP Throughput (GFLOPS)* 6.865 0.84 3.144 0.588 0.488 2.568 2.408 2.088 1.552 3.056 2.408 2.576 0.6 CPU Throughput (GFLOPS) * 3.382 3.326 0.541 0.542 1.194 0.501 0.871 0.281 1.154 1.314 1.313 1.261 0.562 0.683 0.441 0.373 1.704 0.901 0.904 0.747 0.791 112.3 5.794 GPU Throughput (GFLOPS) * 97.506 23.130 61.681 11.955 17.177 35.545 7.761 21.241 2.234 17.319 13.962 8.301 1.177 8.571 0.589 2.249 54.309 5.679 6.686 4.175 2.684 126.8 8.459 Speedup 14.2/28.8 27.5/6.9 19.6/114.1 20.3/22.1 14.3 70.9 8.9 75.6 4.5/1.9 6.7/13.1 5.8/10.6 3.9/6.6 2.1 12.5 1.4 6.0 34.9/31.8 1.8/6.3 2.7/7.4 1.6/5.6 4.5/3.4 1.13 1.46 13 * The throughputs of CT and DB are measured in Mbytes/s and Transactions/s, respectively. 13

Power Efficiency Comparison CPU: 65w, GPU: 238w, DSP: 10w GPU suffers from a low power-efficiency 14 14

Synthetic Aperture Radar Benchmark Simulating a sensor processing chain Data Set Set 1 Set 2 Set 3 Image Size 382x266 762x512 1144x756 FFT/IFF T 28.61 113.38 259.92 Work Load (MFLOP) Match Filtering Interope ration Miscella neous 6.42 22.06 47.08 56.88 195.34 416.96 1.23 4.43 9.62 Total 93.14 335.21 733.58 15 15

Performance Result Data Set Kernel CPU Throughput (GFLOPS) GPU Throughput (GFLOPS) Speedup FFT/IFFT 0.463 5.259 11.3 Set 1 Filtering 0.538 17.165 31.8 Interpolation 0.256 19.274 75.1 Overall 0.312 8.316 26.6 FFT/IFFT 0.581 9.252 15.9 Set 2 Filtering 0.545 25.241 46.3 Interpolation 0.252 17.332 68.8 Overall 0.327 9.507 29.1 FFT/IFFT 0.832 15.155 18.2 Set 3 Filtering 0.523 26.856 51.3 Interpolation 0.248 18.569 74.7 16 Overall 0.346 11.403 32.8 16

Overview of Optimization Techniques Maximizing the usage of on-chip resources Shared memory Registers Reducing memory accessing time Global memory accessing coalesced Overlapping transfers with computation Reducing divergence Warp level parallelism 17 17

Architecture Implication SIMD width: suitable for large vector computing Dynamically configurable SIMD width according to application Shared memory superior to cache for embedded application Data prefetch is preferred Special functions for specific applications Dedicated efficient shuffle network for fft, et. al. Power efficiency is quite low now Reorganizing memory access patterns New interconnection technologies : 3D stacking 18 18

Conclusion 19 Efficient implementations of the HPEC benchmarks on NVidia s Fermi Performance comparison with CPU Kernels: 10X speedup SAR: 30X speedup A detailed analysis provides key insight Optimizing data parallelism algorithm Bottleneck of GPU s architecture for HPEC Publications: Design Automation and Test in Europe (DATE), March 2011 Journal of Parallel and Distributed Computing, submitted under review. 19

Thank You! 20 20