Hardware Acceleration of Pulsar Search on FPGAs using OpenCL
|
|
- Hector Gilbert
- 5 years ago
- Views:
Transcription
1 Hardware Acceleration of Pulsar Search on FPGAs using OpenCL Oliver Sinnen Haomiao Wang & Prabu Thiagaraj (Manchester Uni) Parallel and Reconfigurable Computing Department of Electrical and Computer Engineering University of Auckland Computing for SKA, 2017
2 Strong-field Test of Gravity using Pulsars Image credit: NASA. Image Credit: NASA/Tod Strohmayer (GSFC)/Dana Berry (Chandra X-Ray Observatory)
3 Outline Overview and Task 1 Overview and Task
4 Outline Overview and Task 1 Overview and Task
5 Pulsar and Pulsar Search Observed radiation is a pulse Binary pulsar (Doppler effect) Acceleration search: 1) Time-domain 2) Frequency-domain.
6 Pulsar and Pulsar Search Frequency-domain Using matched filtering technique in Fourier domain to recover the signal into single bin. A r0 [r 0 ]+m/2 k=[r 0 ] m/2 where frequency r 0 is unknown. A k A r 0 k, Summation is computed at a range of frequencies r.
7 Block Overview of Pulsar Search Engine Beamformed Data (BFD) Data Receptor (RCPT) Filterbank Data Chuncks (FDC) Filterbank Data for Selected SP Candidates Dedispersion Buffer Creator (DDBC) Dedispersion Buffers (DB) RFI Mtigation (RFIM) Flagged DB (FDB) Dedispersed Data Buffer (DDB) Dedispersion Transform (DDTR) Dedispersed Data Buffer (DDB) Periodicity Search Buffer (PSB) Periodicity Search Buffer Creator (PSBC) To SDP Candidate Data Output Streamer (CDOS) Single Pulse Detector (SPCT) Complex Fourier Transform (CXFT) From SDP Candidate Folding and Optimsation (FLDO) Filterbank Data for Candidate Folding Full Filterbank Buffer Creator (FFBC) Single Pulse Optimiser (SPOPT) Single Pulse Sifter (SPSIFT) Birdie Zapping (BRDZ) Candidate Sifting (SIFT) Time Domain Candidate Optimisation (TDAO) Harmonic Summing (HRMS) Fourier Transform and Power Spectrum (PWFT) Time Domain Resampler Transform (TDRT) Inverse Complex Fourier Transform (icxft) Dereddening Spectrum (DRED) Common Single Pulse Time domain Acc Freq domain Acc Fourier Domain Candidate Optimisation (FDAO) Fourier Domain Acceleration Search (FDAS)
8 Fourier-domain Acceleration Search (FDAS) FDAS module is applied to search for (binary) pulsars with constant frequency derivatives in frequency-domain Beam2 Beami Beami signals are de-dispersed for 6,000 DMs DM1 PSS Engine_i Single Pulse Search Modules Time Domain Acceleration or BeamN Over 2,000 beams are formed at 4,096 channels/beam DM2... DMj... DM6000 Postprocessing Pre- Processing.RFIM.DDTR.PSBC.CXFT.BRDZ.DRED FDAS Module FT Convolution Module FIR_1 FIR_k FIR_85 Harmonicsum Module 85 FIR filters, maximum length is 421-tap
9 Specification of Task. Parameter Destriiption Value B # of beams DM # of de-dispersion measure (DM) trails 6000 T obs Observation period 540s t limit Time of executing one sample group 88ms N # of complex samples per group 2 22 M # of templates/filter 85 K # of average template/filter length > 200
10 Outline Overview and Task 1 Overview and Task
11 FT Convolution Complex floating-point operations Multiple long FIR filters Large input size Strict time limit Number of acceleration devices (CapEx) Energy consumption (OpEx)
12 Basic Element Time-domain FIR Filter (TDFIR) K 1 y m [i] = k=0 x m [i k]h m [k], for i = 0,1,...N 1 Frequency-domain FIR Filter (FDFIR) F {f h} = F {f } F {h}
13 Hardware Limitation Naïve Time Domain DSP block Single precision floating-point (SPF) multiplications (A + ib) (C + id) = (A C B D) + i(a D + B C) Naïve Frequency Domain Off-chip (global) memory Off-chip memory bandwidth RAM block On-chip (local) memory size 4-Million elements = 32MBytes.
14 Decomposition Algorithms Overlap-add Algorithm Split the coefficient array > OLA-TD Overlap-save Algorithm Split the input array > OLS-FD Length =Ncoef -1 Zero Input Data Input data Length = Ncoef /N +... Split Coefficients C_1 C_2 C_N Length = Ncoef /N -1 Output data_1 Zero Output data_2... Output data (a) OLA Convolve with subset coefficient group i Output data_i Output data_n ID_1 ID_i ID_2 Convolution with FIR filter ID_3... PD_i Discard the Ncoef-1 elements PD_1 PD_2 PD_3... PD_N Output Data (b) OLS Split the input into N small groups ID_N
15 Outline Overview and Task 1 Overview and Task
16 High-level Techniques 2GB DDR3 x 2 Memory (Global Memory) Maxeler MaxCompiler using Java to develop FPGA (HPCC2016) Open Computing Language (OpenCL) for FPGAs (Intel FPGA Cards), GPUs, and CPUs (FPT2016, best paper candidate) DDR Controller & PHY DDR Controller & PHY Global Memory Interconnect Global Memory Interconnect Kernel Kernel Pipeline Kernel Pipeline Kernel Pipeline Kernels Pipeline Pipeline Kernel Kernel Pipeline Kernel Pipeline Kernel Pipeline Kernels Pipeline Pipeline FPGA_i PCIe... FPGA_0 PCIe Block RAM Block RAM... Host PCIe Core 1... Core 4 Local Memory Interconnect Local Memory Interconnect Memory (DDR3 and SSD)
17 Kernel Structures OLA
18 Kernel Structures OLS AOLS Input Processed coefficients Output 2 nd launch 1 st launch Switch Global Memory 1 st launch 1 st launch: Bank1 2 nd launch 2 nd launch: Bank2 Global Memory 1 st launch: Bank2 2 nd launch: Bank1 Data Fetch and Multiplication Kernel (NDRange) Channels FFT/IFFT Kernel (Single) 1 st launch FFT 2 nd launch IFFT Channels Bit-Reverse Kernel (NDRange) Input Global Memory (Bank1) Processed coefficients Output FFT Data Fetch Kernel (NDRange) Global Memory (Bank2) IFFT Bit-Reverse Kernel (NDRange) Channels Channels FFT and Multiplication Kernel (Single) IFFT Kernel (Single) Channels Channels FFT Bit-Reverse Kernel (NDRange) Channels IFFT Data Fetch Kernel (NDRange) TOLS.
19 Outline Overview and Task 1 Overview and Task
20 Platform Overview and Task Table: Details of FPGA and GPU Platforms Device (Board) Terasic DE5-Net Sapphire Nitro R7 370 Hardware Intel Stratix V 5SGXA7 AMD Radeon R7 370 Technology 28nm 28nm 622,000 LEs Compute resource 256 DSP blocks 1024 Stream Processors On-chip memory size 50Mb Global memory size 2 x 2GB DDR3 3GB GDDR5 Global memory frequency 800MHz 5, 600MHz Memory interface width 2 x 64-bit 256-bit Max clock frequency 985MHz OpenCL Max power consumption 150W
21 Latency TDFIR vs FDFIR 4 TDFIR Kernels Naïve OLA TD-Naïve-64S TD-Naïve-64N OLA-64S OLA-64N 5 FDFIR Kernels Naïve OLS FD-Naïve AOLS TOLS AOLS-1024 AOLS-2048 AOLS-4096 TOLS-1024
22 Latency TDFIR vs FDFIR Latencies of a single FPGA (Intel Stratix V A7) in processing same input array using 9 different OpenCL kernels: Kernel Execution Latency (ms) TD Naïve 64S TD Naïve 64N OLA 64S OLA 64N FD Naïve AOLS 1024 AOLS 2048 AOLS 4096 TOLS FIR Filter Length
23 Multiple FIR Filters Even fastest kernel cannot meet time limit => Implement multiple FIR filters in parallel Problem: Bandwidth of off-chip memory is main problem Solution: Do more processing! Calculate power of complex values (need input in next stage) Problem: Number of DSP blocks limits number of parallelisable filters Solution: Downscale the FFT engine input size: 8 points > 4 points
24 Multiple FIR Filters Even fastest kernel cannot meet time limit => Implement multiple FIR filters in parallel Problem: Bandwidth of off-chip memory is main problem Solution: Do more processing! Calculate power of complex values (need input in next stage) Problem: Number of DSP blocks limits number of parallelisable filters Solution: Downscale the FFT engine input size: 8 points > 4 points
25 Multiple FIR Filters Even fastest kernel cannot meet time limit => Implement multiple FIR filters in parallel Problem: Bandwidth of off-chip memory is main problem Solution: Do more processing! Calculate power of complex values (need input in next stage) Problem: Number of DSP blocks limits number of parallelisable filters Solution: Downscale the FFT engine input size: 8 points > 4 points
26 Multiple FIR Filters 1) Multiple FIR filters 2) Power of complex values Becomes optimisation problem! TOLS points AOLS points 2 x AOLS points AOLS-1024-P 8points 2 x AOLS-1024-P 8points AOLS-1024-P 4points 3 x AOLS-1024-P 4points AOLS-2048-P 4points 3 x AOLS-2048-P 4points Unused DSP blocks FFT Engine Element-wise multiplications Power
27 Multiple FIR Filters and FPGAs 600 latency of 84 FIR Filters (ms) Device 1 FPGA 2 FPGAs 3 FPGAs Kernel Peformance (GFLOPS) Device 1 FPGA 2 FPGAs 3 FPGAs 0 0 3xAOLS 1024 P 3xAOLS 2048 P 3xAOLS 4096 P OpenCL Kernels 3xAOLS 1024 P 3xAOLS 2048 P 3xAOLS 4096 P OpenCL Kernels 4 20 Power Efficiency (GFLOPS/watt) Device 1 FPGA 2 FPGAs 3 FPGAs Energy Dissipation (Joule) Device 1 FPGA 2 FPGAs 3 FPGAs 0 0 3xAOLS 1024 P 3xAOLS 2048 P 3xAOLS 4096 P OpenCL Kernels 3xAOLS 1024 P 3xAOLS 2048 P 3xAOLS 4096 P OpenCL Kernels
28 Latency FPGA vs GPU Latencies of a single GPU (AMD Radeon R7 370) and 3 FPGAs in processing 2 and 4 Million points: Kernel Execution Latency (ms) xAOLS 2048 P (4 Million) 3xAOLS 2048 P (2 Million) GPU FD (4 Million) GPU FD (2 Million) Total FIR Filters
29 Energy FPGA vs GPU Energy dissipation of single FPGA and GPU in executing the same task with different kernels: TOLS 1024 AOLS xAOLS 4096 P 3xAOLS 2048 P 3xAOLS 1024 P GPU FD Energy Dissipation (Joule)
30 Outline Overview and Task 1 Overview and Task
31 Harmonic-summing Input: Filter-output-plane (FOP, SPF points, ~1.33GBytes) Processing flow: 8 harmonic planes are generated based on FOP and the stretched planes One threshold for each row of each harmonic plane (overall: 85 8 = 680) Hundreds of candidates are recorded Output: Candidates each candidate contains the indexes of filter, harmonic, bin, and amplitude (up to 64-bit)
32 Harmonic-summing Problems: Input data size is too large (~1.33GBytes) On-chip memory size too small for all planes The cost of computation task is very cheap (SPF adds) and easy to parallelise Challenge: Off-chip memory bandwidth is issue Optimise data use (and reuse), computation not an issue
33 Conclusion Overview and Task FPGA-based implementation and optimisation of FT convolution (FIR filter), based on OLA and OLS algorithms High-level approaches to such tasks works well Covered large design space Easy porting and sharing with partners With multiple FPGAs, FPGA implementation has advantage over GPU in both performance (GFLOPS) and Energy efficiency
Accelerating the Pulsar Search Pipeline with FPGAs, Programmed in OpenCL
Accelerating the Pulsar Search Pipeline with FPGAs, Programmed in OpenCL Oliver Sinnen, Tyrone Sherwin, and Haomiao Wang & Prabu Thiagaraj (Manchester Uni/Raman Research Institute, Bangalore) Parallel
More informationarxiv: v2 [cs.dc] 29 Jun 2018
Journal of Astronomical Instrumentation c World Scientific Publishing Company Combining Multiple Optimized FPGA-based Pulsar Search Modules Using OpenCL Haomiao Wang, Prabu Thiagaraj, and Oliver Sinnen
More informationTHE Square Kilometre Array (SKA) is built to extend our. Harmonic-summing Module of SKA on FPGA Optimising the Irregular Memory Accesses
1 Harmonic-summing Module of SKA on FPGA Optimising the Irregular Memory Accesses Haomiao Wang, Prabu Thiagaraj, and Oliver Sinnen arxiv:1805.12258v2 [cs.dc] 29 Jun 2018 Abstract The Square Kilometre Array
More informationFrequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System
Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Chi Zhang, Viktor K Prasanna University of Southern California {zhan527, prasanna}@usc.edu fpga.usc.edu ACM
More informationEvaluating the Potential of Graphics Processors for High Performance Embedded Computing
Evaluating the Potential of Graphics Processors for High Performance Embedded Computing Shuai Mu, Chenxi Wang, Ming Liu, Yangdong Deng Department of Micro-/Nano-electronics Tsinghua University Outline
More informationComputed Tomography (CT) Scan Image Reconstruction on the SRC-7 David Pointer SRC Computers, Inc.
Computed Tomography (CT) Scan Image Reconstruction on the SRC-7 David Pointer SRC Computers, Inc. CT Image Reconstruction Herman Head Sinogram Herman Head Reconstruction CT Image Reconstruction for all
More informationATS-GPU Real Time Signal Processing Software
Transfer A/D data to at high speed Up to 4 GB/s transfer rate for PCIe Gen 3 digitizer boards Supports CUDA compute capability 2.0+ Designed to work with AlazarTech PCI Express waveform digitizers Optional
More informationAccelerating the acceleration search a case study. By Chris Laidler
Accelerating the acceleration search a case study By Chris Laidler Optimization cycle Assess Test Parallelise Optimise Profile Identify the function or functions in which the application is spending most
More informationPerformance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference
The 2017 IEEE International Symposium on Workload Characterization Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference Shin-Ying Lee
More informationELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II
ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Organization Part II Ujjwal Guin, Assistant Professor Department of Electrical and Computer Engineering Auburn University, Auburn,
More informationXPU A Programmable FPGA Accelerator for Diverse Workloads
XPU A Programmable FPGA Accelerator for Diverse Workloads Jian Ouyang, 1 (ouyangjian@baidu.com) Ephrem Wu, 2 Jing Wang, 1 Yupeng Li, 1 Hanlin Xie 1 1 Baidu, Inc. 2 Xilinx Outlines Background - FPGA for
More informationPerforming Multi-Phased Radar Processing with a Very Deep FPGA Pipeline
Performing Multi-Phased Radar Processing with a Very Deep FPGA Pipeline Jeffrey T. Muehring and John K. Antonio School of Computer Science University of Oklahoma antonio@ou.edu 2000 MAPLD Conference The
More informationEnergy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package
High Performance Machine Learning Workshop Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package Matheus Souza, Lucas Maciel, Pedro Penna, Henrique Freitas 24/09/2018 Agenda Introduction
More informationQsys and IP Core Integration
Qsys and IP Core Integration Stephen A. Edwards (after David Lariviere) Columbia University Spring 2016 IP Cores Altera s IP Core Integration Tools Connecting IP Cores IP Cores Cyclone V SoC: A Mix of
More informationRevolutionizing the Datacenter
Power-Efficient Machine Learning using FPGAs on POWER Systems Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx Revolutionizing the Datacenter Join the Conversation #OpenPOWERSummit Top-5
More informationJohn W. Romein. Netherlands Institute for Radio Astronomy (ASTRON) Dwingeloo, the Netherlands
Signal Processing on GPUs for Radio Telescopes John W. Romein Netherlands Institute for Radio Astronomy (ASTRON) Dwingeloo, the Netherlands 1 Overview radio telescopes six radio telescope algorithms on
More informationREAL TIME DIGITAL SIGNAL PROCESSING
REAL TIME DIGITAL SIGNAL PROCESSING UTN - FRBA 2011 www.electron.frba.utn.edu.ar/dplab Introduction Why Digital? A brief comparison with analog. Advantages Flexibility. Easily modifiable and upgradeable.
More informationAltera SDK for OpenCL
Altera SDK for OpenCL A novel SDK that opens up the world of FPGAs to today s developers Altera Technology Roadshow 2013 Today s News Altera today announces its SDK for OpenCL Altera Joins Khronos Group
More informationMachine Learning & Science Data Processing
Machine Learning & Science Data Processing Rob Lyon robert.lyon@manchester.ac.uk SKA Group University of Manchester Machine Learning (1) Collective term for branch of A.I. Uses statistical tools to make
More informationn N c CIni.o ewsrg.au
@NCInews NCI and Raijin National Computational Infrastructure 2 Our Partners General purpose, highly parallel processors High FLOPs/watt and FLOPs/$ Unit of execution Kernel Separate memory subsystem GPGPU
More informationStudying GPU based RTC for TMT NFIRAOS
Studying GPU based RTC for TMT NFIRAOS Lianqi Wang Thirty Meter Telescope Project RTC Workshop Dec 04, 2012 1 Outline Tomography with iterative algorithms on GPUs Matri vector multiply approach Assembling
More informationInterconnection Network for Tightly Coupled Accelerators Architecture
Interconnection Network for Tightly Coupled Accelerators Architecture Toshihiro Hanawa, Yuetsu Kodama, Taisuke Boku, Mitsuhisa Sato Center for Computational Sciences University of Tsukuba, Japan 1 What
More informationCUDA. Matthew Joyner, Jeremy Williams
CUDA Matthew Joyner, Jeremy Williams Agenda What is CUDA? CUDA GPU Architecture CPU/GPU Communication Coding in CUDA Use cases of CUDA Comparison to OpenCL What is CUDA? What is CUDA? CUDA is a parallel
More informationParallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU
Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU Lifan Xu Wei Wang Marco A. Alvarez John Cavazos Dongping Zhang Department of Computer and Information Science University of Delaware
More informationUsing a Scalable Parallel 2D FFT for Image Enhancement
Introduction Using a Scalable Parallel 2D FFT for Image Enhancement Yaniv Sapir Adapteva, Inc. Email: yaniv@adapteva.com Frequency domain operations on spatial or time data are often used as a means for
More informationExperts in Application Acceleration Synective Labs AB
Experts in Application Acceleration 1 2009 Synective Labs AB Magnus Peterson Synective Labs Synective Labs quick facts Expert company within software acceleration Based in Sweden with offices in Gothenburg
More informationThroughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks
Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks Naveen Suda, Vikas Chandra *, Ganesh Dasika *, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jae-sun Seo, Yu
More informationParallel FIR Filters. Chapter 5
Chapter 5 Parallel FIR Filters This chapter describes the implementation of high-performance, parallel, full-precision FIR filters using the DSP48 slice in a Virtex-4 device. ecause the Virtex-4 architecture
More informationPoS(10th EVN Symposium)098
1 Joint Institute for VLBI in Europe P.O. Box 2, 7990 AA Dwingeloo, The Netherlands E-mail: szomoru@jive.nl The, a Joint Research Activity in the RadioNet FP7 2 programme, has as its aim the creation of
More informationOSKAR: Simulating data from the SKA
OSKAR: Simulating data from the SKA Oxford e-research Centre, 4 June 2014 Fred Dulwich, Ben Mort, Stef Salvini 1 Overview Simulating interferometer data for SKA: Radio interferometry basics. Measurement
More informationhigh performance medical reconstruction using stream programming paradigms
high performance medical reconstruction using stream programming paradigms This Paper describes the implementation and results of CT reconstruction using Filtered Back Projection on various stream programming
More informationPactron FPGA Accelerated Computing Solutions
Pactron FPGA Accelerated Computing Solutions Intel Xeon + Altera FPGA 2015 Pactron HJPC Corporation 1 Motivation for Accelerators Enhanced Performance: Accelerators compliment CPU cores to meet market
More informationEnergy Optimizations for FPGA-based 2-D FFT Architecture
Energy Optimizations for FPGA-based 2-D FFT Architecture Ren Chen and Viktor K. Prasanna Ming Hsieh Department of Electrical Engineering University of Southern California Ganges.usc.edu/wiki/TAPAS Outline
More informationA Study of Data Partitioning on OpenCL-based FPGAs. Zeke Wang (NTU Singapore), Bingsheng He (NTU Singapore), Wei Zhang (HKUST)
A Study of Data Partitioning on OpenC-based FPGAs Zeke Wang (NTU Singapore), Bingsheng He (NTU Singapore), Wei Zhang (HKUST) 1 Outline Background and Motivations Data Partitioning on FPGA OpenC on FPGA
More informationChapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1)
Department of Electr rical Eng ineering, Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1) 王振傑 (Chen-Chieh Wang) ccwang@mail.ee.ncku.edu.tw ncku edu Depar rtment of Electr rical Engineering,
More informationMartin Dubois, ing. Contents
Martin Dubois, ing Contents Without OpenNet vs With OpenNet Technical information Possible applications Artificial Intelligence Deep Packet Inspection Image and Video processing Network equipment development
More informationGzip Compression Using Altera OpenCL. Mohamed Abdelfattah (University of Toronto) Andrei Hagiescu Deshanand Singh
Gzip Compression Using Altera OpenCL Mohamed Abdelfattah (University of Toronto) Andrei Hagiescu Deshanand Singh Gzip Widely-used lossless compression program Gzip = LZ77 + Huffman Big data needs fast
More informationFPGA Acceleration of 3D Component Matching using OpenCL
FPGA Acceleration of 3D Component Introduction 2D component matching, blob extraction or region extraction, is commonly used in computer vision for detecting connected regions that meet pre-determined
More informationTracking Acceleration with FPGAs. Future Tracking, CMS Week 4/12/17 Sioni Summers
Tracking Acceleration with FPGAs Future Tracking, CMS Week 4/12/17 Sioni Summers Contents Introduction FPGAs & 'DataFlow Engines' for computing Device architecture Maxeler HLT Tracking Acceleration 2 Introduction
More informationGPUs and Emerging Architectures
GPUs and Emerging Architectures Mike Giles mike.giles@maths.ox.ac.uk Mathematical Institute, Oxford University e-infrastructure South Consortium Oxford e-research Centre Emerging Architectures p. 1 CPUs
More informationA Parallel Hardware Architecture for Information-Theoretic Adaptive Filtering
A Parallel Hardware Architecture for Information-Theoretic Adaptive Filtering HPRCTA 2010 Stefan Craciun Dr. Alan D. George Dr. Herman Lam Dr. Jose C. Principe November 14, 2010 NSF CHREC Center ECE Department,
More informationAMD Embedded PCIe ADD-IN BOARD E6760/E6460 Datasheet. (ER93FLA/ER91FLA-xx)
AMD Embedded PCIe ADD-IN BOARD E6760/E6460 Datasheet (ER93FLA/ER91FLA-xx) CONTENTS 1. Feature... 3 2. Functional Overview... 4 2.1. Memory Interface... 4 2.2. Acceleration Features... 4 2.3. Avivo Display
More informationFAST FIR FILTERS FOR SIMD PROCESSORS WITH LIMITED MEMORY BANDWIDTH
Key words: Digital Signal Processing, FIR filters, SIMD processors, AltiVec. Grzegorz KRASZEWSKI Białystok Technical University Department of Electrical Engineering Wiejska
More informationGPU Programming with Ateji PX June 8 th Ateji All rights reserved.
GPU Programming with Ateji PX June 8 th 2010 Ateji All rights reserved. Goals Write once, run everywhere, even on a GPU Target heterogeneous architectures from Java GPU accelerators OpenCL standard Get
More informationGPU Computation Strategies & Tricks. Ian Buck NVIDIA
GPU Computation Strategies & Tricks Ian Buck NVIDIA Recent Trends 2 Compute is Cheap parallelism to keep 100s of ALUs per chip busy shading is highly parallel millions of fragments per frame 0.5mm 64-bit
More informationHigh Performance DoD DSP Applications
High Performance DoD DSP Applications Robert Bond Embedded Digital Systems Group 23 August 2003 Slide-1 Outline DoD High-Performance DSP Applications Middleware (with some streaming constructs) Future
More informationSDA: Software-Defined Accelerator for Large- Scale DNN Systems
SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang, 1 Shiding Lin, 1 Wei Qi, Yong Wang, Bo Yu, Song Jiang, 2 1 Baidu, Inc. 2 Wayne State University Introduction of Baidu A dominant
More informationNew Zealand Involvement in Solving the SKA Computing Challenges
New Zealand Involvement in Solving the SKA Computing Challenges D R ANDREW E N S O R D I R ECTO R H P C R ESEARC H L A B O R ATORY/ D I R ECTOR N Z SKA ALLIANCE COMPUTING FO R S K A COLLO Q U I UM 2 0
More informationOn Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators
On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC
More informationField Programmable Gate Array (FPGA) Devices
Field Programmable Gate Array (FPGA) Devices 1 Contents Altera FPGAs and CPLDs CPLDs FPGAs with embedded processors ACEX FPGAs Cyclone I,II FPGAs APEX FPGAs Stratix FPGAs Stratix II,III FPGAs Xilinx FPGAs
More informationSDA: Software-Defined Accelerator for Large- Scale DNN Systems
SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang, 1 Shiding Lin, 1 Wei Qi, 1 Yong Wang, 1 Bo Yu, 1 Song Jiang, 2 1 Baidu, Inc. 2 Wayne State University Introduction of Baidu A
More informationReNoC: A Network-on-Chip Architecture with Reconfigurable Topology
1 ReNoC: A Network-on-Chip Architecture with Reconfigurable Topology Mikkel B. Stensgaard and Jens Sparsø Technical University of Denmark Technical University of Denmark Outline 2 Motivation ReNoC Basic
More informationVersal: AI Engine & Programming Environment
Engineering Director, Xilinx Silicon Architecture Group Versal: Engine & Programming Environment Presented By Ambrose Finnerty Xilinx DSP Technical Marketing Manager October 16, 2018 MEMORY MEMORY MEMORY
More informationFPGA Polyphase Filter Bank Study & Implementation
FPGA Polyphase Filter Bank Study & Implementation Raghu Rao Matthieu Tisserand Mike Severa Prof. John Villasenor Image Communications/. Electrical Engineering Dept. UCLA 1 Introduction This document describes
More informationAMD HD7750 2GB PCIEx16
AMD HD7750 2GB PCIEx16 ADVANTECH MODEL: GFX-AH7750L16-5J MPN number: 1A1-E000130ADP Performance PCIe Graphics 4 x Mini DP CONTENTS 1. Specification... 3 2. Functional Overview... 4 2.1. Memory Interface...
More informationIntel HLS Compiler: Fast Design, Coding, and Hardware
white paper Intel HLS Compiler Intel HLS Compiler: Fast Design, Coding, and Hardware The Modern FPGA Workflow Authors Melissa Sussmann HLS Product Manager Intel Corporation Tom Hill OpenCL Product Manager
More informationGPUS FOR NGVLA. M Clark, April 2015
S FOR NGVLA M Clark, April 2015 GAMING DESIGN ENTERPRISE VIRTUALIZATION HPC & CLOUD SERVICE PROVIDERS AUTONOMOUS MACHINES PC DATA CENTER MOBILE The World Leader in Visual Computing 2 What is a? Tesla K40
More informationA HT3 Platform for Rapid Prototyping and High Performance Reconfigurable Computing
A HT3 Platform for Rapid Prototyping and High Performance Reconfigurable Computing Second International Workshop on HyperTransport Research and Application (WHTRA 2011) University of Heidelberg Computer
More informationWhite Paper Taking Advantage of Advances in FPGA Floating-Point IP Cores
White Paper Recently available FPGA design tools and IP provide a substantial reduction in computational resources, as well as greatly easing the implementation effort in a floating-point datapath. Moreover,
More informationIntel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins
Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Outline History & Motivation Architecture Core architecture Network Topology Memory hierarchy Brief comparison to GPU & Tilera Programming Applications
More informationAMD HD7750 PCIe ADD-IN BOARD. Datasheet (GFX-A3T2-01FST1)
AMD HD7750 PCIe ADD-IN BOARD Datasheet (GFX-A3T2-01FST1) CONTENTS 1. Feature... 3 2. Functional Overview... 4 2.1. Memory Interface... 4 2.2. Memory Aperture Size... 4 2.3. Avivo Display System... 5 2.4.
More informationFPGAs & Multi-FPGA Systems. FPGA Abstract Model. Logic cells imbedded in a general routing structure. Logic cells usually contain:
s & Multi- Systems Fit logic into a prefabricated system Fixed inter-chip routing Fixed on-chip logic & routing XBA Partitioning Global outing Technology Map. XBA XBA Placement outing 23 Abstract Model
More informationThe University of Adelaide, School of Computer Science 13 September 2018
Computer Architecture A Quantitative Approach, Sixth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per
More informationIMPLEMENTATION OF DISTRIBUTED CANNY EDGE DETECTOR ON FPGA
IMPLEMENTATION OF DISTRIBUTED CANNY EDGE DETECTOR ON FPGA T. Rupalatha 1, Mr.C.Leelamohan 2, Mrs.M.Sreelakshmi 3 P.G. Student, Department of ECE, C R Engineering College, Tirupati, India 1 Associate Professor,
More informationGPU Architecture. Alan Gray EPCC The University of Edinburgh
GPU Architecture Alan Gray EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? Architectural reasons for accelerator performance advantages Latest GPU Products From
More informationLecture 1: Gentle Introduction to GPUs
CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming Lecture 1: Gentle Introduction to GPUs Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Who Am I? Mohamed
More informationThe Nios II Family of Configurable Soft-core Processors
The Nios II Family of Configurable Soft-core Processors James Ball August 16, 2005 2005 Altera Corporation Agenda Nios II Introduction Configuring your CPU FPGA vs. ASIC CPU Design Instruction Set Architecture
More informationOptimizing CNN-based Object Detection Algorithms on Embedded FPGA Platforms
Optimizing CNN-based Object Detection Algorithms on Embedded FPGA Platforms Ruizhe Zhao 1, Xinyu Niu 1, Yajie Wu 2, Wayne Luk 1, and Qiang Liu 3 1 Imperial College London {ruizhe.zhao15,niu.xinyu10,w.luk}@imperial.ac.uk
More informationEnergy Efficient Transparent Library Accelera4on with CAPI Heiner Giefers IBM Research Zurich
Energy Efficient Transparent Library Accelera4on with CAPI Heiner Giefers IBM Research Zurich Revolu'onizing the Datacenter Datacenter Join the Conversa'on #OpenPOWERSummit Towards highly efficient data
More informationGeneric Polyphase Filterbanks with CUDA
Generic Polyphase Filterbanks with CUDA Jan Krämer German Aerospace Center Communication and Navigation Satellite Networks Weßling 04.02.2017 Knowledge for Tomorrow www.dlr.de Slide 1 of 27 > Generic Polyphase
More informationAdapted from David Patterson s slides on graduate computer architecture
Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Ten Advanced Optimizations of Cache Performance Memory Technology and Optimizations Virtual Memory and Virtual
More informationFlexible Architecture Research Machine (FARM)
Flexible Architecture Research Machine (FARM) RAMP Retreat June 25, 2009 Jared Casper, Tayo Oguntebi, Sungpack Hong, Nathan Bronson Christos Kozyrakis, Kunle Olukotun Motivation Why CPUs + FPGAs make sense
More informationUnderstanding Peak Floating-Point Performance Claims
white paper FPGA Understanding Peak ing-point Performance Claims Learn how to calculate and compare the peak floating-point capabilities of digital signal processors (DSPs), graphics processing units (GPUs),
More informationIntroduction to Microprocessor
Introduction to Microprocessor Slide 1 Microprocessor A microprocessor is a multipurpose, programmable, clock-driven, register-based electronic device That reads binary instructions from a storage device
More informationFast Fourier Transform IP Core v1.0 Block Floating-Point Streaming Radix-2 Architecture. Introduction. Features. Data Sheet. IPC0002 October 2014
Introduction The FFT/IFFT IP core is a highly configurable Fast Fourier Transform (FFT) and Inverse Fast Fourier Transform (IFFT) VHDL IP component. The core performs an N-point complex forward or inverse
More informationEnergy Efficient Adaptive Beamforming on Sensor Networks
Energy Efficient Adaptive Beamforming on Sensor Networks Viktor K. Prasanna Bhargava Gundala, Mitali Singh Dept. of EE-Systems University of Southern California email: prasanna@usc.edu http://ceng.usc.edu/~prasanna
More informationAccelerating computation with FPGAs
Accelerating computation with FPGAs Michael J. Flynn Maxeler Technologies and Stanford University M. J. Flynn Maxeler Technologies 1 Based on work done by my colleagues at Maxeler, especially Oskar Mencer,
More informationGeneral Purpose GPU Computing in Partial Wave Analysis
JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data
More informationMultimedia in Mobile Phones. Architectures and Trends Lund
Multimedia in Mobile Phones Architectures and Trends Lund 091124 Presentation Henrik Ohlsson Contact: henrik.h.ohlsson@stericsson.com Working with multimedia hardware (graphics and displays) at ST- Ericsson
More informationComputer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more
More informationFPGA VHDL Design Flow AES128 Implementation
Sakinder Ali FPGA VHDL Design Flow AES128 Implementation Field Programmable Gate Array Basic idea: two-dimensional array of logic blocks and flip-flops with a means for the user to configure: 1. The interconnection
More informationPORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune
PORTING CP2K TO THE INTEL XEON PHI ARCHER Technical Forum, Wed 30 th July Iain Bethune (ibethune@epcc.ed.ac.uk) Outline Xeon Phi Overview Porting CP2K to Xeon Phi Performance Results Lessons Learned Further
More informationScalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA
Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA Yufei Ma, Naveen Suda, Yu Cao, Jae-sun Seo, Sarma Vrudhula School of Electrical, Computer and Energy Engineering School
More informationECE 571 Advanced Microprocessor-Based Design Lecture 20
ECE 571 Advanced Microprocessor-Based Design Lecture 20 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 12 April 2016 Project/HW Reminder Homework #9 was posted 1 Raspberry Pi
More informationCHAPTER 4. DIGITAL DOWNCONVERTER FOR WiMAX SYSTEM
CHAPTER 4 IMPLEMENTATION OF DIGITAL UPCONVERTER AND DIGITAL DOWNCONVERTER FOR WiMAX SYSTEM 4.1 Introduction FPGAs provide an ideal implementation platform for developing broadband wireless systems such
More informationWhen MPPDB Meets GPU:
When MPPDB Meets GPU: An Extendible Framework for Acceleration Laura Chen, Le Cai, Yongyan Wang Background: Heterogeneous Computing Hardware Trend stops growing with Moore s Law Fast development of GPU
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology
More informationLow Cost FPGA Implementation of Fresnel Transform for Digital Holography
Proceedings of the 2 nd World Congress on Electrical Engineering and Computer Systems and Science (EECSS 16) Budapest, Hungary - August 16-17, 2016 Paper No. EEE 129 DOI: 10.11159/eee16.129 Low Cost FPGA
More informationComputer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per
More informationAn Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection
An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection Hiroyuki Usui, Jun Tanabe, Toru Sano, Hui Xu, and Takashi Miyamori Toshiba Corporation, Kawasaki, Japan Copyright 2013,
More informationVector IRAM: A Microprocessor Architecture for Media Processing
IRAM: A Microprocessor Architecture for Media Processing Christoforos E. Kozyrakis kozyraki@cs.berkeley.edu CS252 Graduate Computer Architecture February 10, 2000 Outline Motivation for IRAM technology
More informationCSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University
CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand
More informationChapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY
Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY 1 Memories: Review SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: value is stored
More informationBuilding NVLink for Developers
Building NVLink for Developers Unleashing programmatic, architectural and performance capabilities for accelerated computing Why NVLink TM? Simpler, Better and Faster Simplified Programming No specialized
More information"On the Capability and Achievable Performance of FPGAs for HPC Applications"
"On the Capability and Achievable Performance of FPGAs for HPC Applications" Wim Vanderbauwhede School of Computing Science, University of Glasgow, UK Or in other words "How Fast Can Those FPGA Thingies
More informationChapter 5. Large and Fast: Exploiting Memory Hierarchy
Chapter 5 Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic disk 5ms 20ms, $0.20 $2 per
More informationGeneral Purpose GPU Programming. Advanced Operating Systems Tutorial 9
General Purpose GPU Programming Advanced Operating Systems Tutorial 9 Tutorial Outline Review of lectured material Key points Discussion OpenCL Future directions 2 Review of Lectured Material Heterogeneous
More informationTowards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA
Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA Junzhong Shen, You Huang, Zelong Wang, Yuran Qiao, Mei Wen, Chunyuan Zhang National University of Defense Technology,
More informationCSE 599 I Accelerated Computing - Programming GPUS. Memory performance
CSE 599 I Accelerated Computing - Programming GPUS Memory performance GPU Teaching Kit Accelerated Computing Module 6.1 Memory Access Performance DRAM Bandwidth Objective To learn that memory bandwidth
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more
More information