Hardware Acceleration of Pulsar Search on FPGAs using OpenCL

Size: px
Start display at page:

Download "Hardware Acceleration of Pulsar Search on FPGAs using OpenCL"

Transcription

1 Hardware Acceleration of Pulsar Search on FPGAs using OpenCL Oliver Sinnen Haomiao Wang & Prabu Thiagaraj (Manchester Uni) Parallel and Reconfigurable Computing Department of Electrical and Computer Engineering University of Auckland Computing for SKA, 2017

2 Strong-field Test of Gravity using Pulsars Image credit: NASA. Image Credit: NASA/Tod Strohmayer (GSFC)/Dana Berry (Chandra X-Ray Observatory)

3 Outline Overview and Task 1 Overview and Task

4 Outline Overview and Task 1 Overview and Task

5 Pulsar and Pulsar Search Observed radiation is a pulse Binary pulsar (Doppler effect) Acceleration search: 1) Time-domain 2) Frequency-domain.

6 Pulsar and Pulsar Search Frequency-domain Using matched filtering technique in Fourier domain to recover the signal into single bin. A r0 [r 0 ]+m/2 k=[r 0 ] m/2 where frequency r 0 is unknown. A k A r 0 k, Summation is computed at a range of frequencies r.

7 Block Overview of Pulsar Search Engine Beamformed Data (BFD) Data Receptor (RCPT) Filterbank Data Chuncks (FDC) Filterbank Data for Selected SP Candidates Dedispersion Buffer Creator (DDBC) Dedispersion Buffers (DB) RFI Mtigation (RFIM) Flagged DB (FDB) Dedispersed Data Buffer (DDB) Dedispersion Transform (DDTR) Dedispersed Data Buffer (DDB) Periodicity Search Buffer (PSB) Periodicity Search Buffer Creator (PSBC) To SDP Candidate Data Output Streamer (CDOS) Single Pulse Detector (SPCT) Complex Fourier Transform (CXFT) From SDP Candidate Folding and Optimsation (FLDO) Filterbank Data for Candidate Folding Full Filterbank Buffer Creator (FFBC) Single Pulse Optimiser (SPOPT) Single Pulse Sifter (SPSIFT) Birdie Zapping (BRDZ) Candidate Sifting (SIFT) Time Domain Candidate Optimisation (TDAO) Harmonic Summing (HRMS) Fourier Transform and Power Spectrum (PWFT) Time Domain Resampler Transform (TDRT) Inverse Complex Fourier Transform (icxft) Dereddening Spectrum (DRED) Common Single Pulse Time domain Acc Freq domain Acc Fourier Domain Candidate Optimisation (FDAO) Fourier Domain Acceleration Search (FDAS)

8 Fourier-domain Acceleration Search (FDAS) FDAS module is applied to search for (binary) pulsars with constant frequency derivatives in frequency-domain Beam2 Beami Beami signals are de-dispersed for 6,000 DMs DM1 PSS Engine_i Single Pulse Search Modules Time Domain Acceleration or BeamN Over 2,000 beams are formed at 4,096 channels/beam DM2... DMj... DM6000 Postprocessing Pre- Processing.RFIM.DDTR.PSBC.CXFT.BRDZ.DRED FDAS Module FT Convolution Module FIR_1 FIR_k FIR_85 Harmonicsum Module 85 FIR filters, maximum length is 421-tap

9 Specification of Task. Parameter Destriiption Value B # of beams DM # of de-dispersion measure (DM) trails 6000 T obs Observation period 540s t limit Time of executing one sample group 88ms N # of complex samples per group 2 22 M # of templates/filter 85 K # of average template/filter length > 200

10 Outline Overview and Task 1 Overview and Task

11 FT Convolution Complex floating-point operations Multiple long FIR filters Large input size Strict time limit Number of acceleration devices (CapEx) Energy consumption (OpEx)

12 Basic Element Time-domain FIR Filter (TDFIR) K 1 y m [i] = k=0 x m [i k]h m [k], for i = 0,1,...N 1 Frequency-domain FIR Filter (FDFIR) F {f h} = F {f } F {h}

13 Hardware Limitation Naïve Time Domain DSP block Single precision floating-point (SPF) multiplications (A + ib) (C + id) = (A C B D) + i(a D + B C) Naïve Frequency Domain Off-chip (global) memory Off-chip memory bandwidth RAM block On-chip (local) memory size 4-Million elements = 32MBytes.

14 Decomposition Algorithms Overlap-add Algorithm Split the coefficient array > OLA-TD Overlap-save Algorithm Split the input array > OLS-FD Length =Ncoef -1 Zero Input Data Input data Length = Ncoef /N +... Split Coefficients C_1 C_2 C_N Length = Ncoef /N -1 Output data_1 Zero Output data_2... Output data (a) OLA Convolve with subset coefficient group i Output data_i Output data_n ID_1 ID_i ID_2 Convolution with FIR filter ID_3... PD_i Discard the Ncoef-1 elements PD_1 PD_2 PD_3... PD_N Output Data (b) OLS Split the input into N small groups ID_N

15 Outline Overview and Task 1 Overview and Task

16 High-level Techniques 2GB DDR3 x 2 Memory (Global Memory) Maxeler MaxCompiler using Java to develop FPGA (HPCC2016) Open Computing Language (OpenCL) for FPGAs (Intel FPGA Cards), GPUs, and CPUs (FPT2016, best paper candidate) DDR Controller & PHY DDR Controller & PHY Global Memory Interconnect Global Memory Interconnect Kernel Kernel Pipeline Kernel Pipeline Kernel Pipeline Kernels Pipeline Pipeline Kernel Kernel Pipeline Kernel Pipeline Kernel Pipeline Kernels Pipeline Pipeline FPGA_i PCIe... FPGA_0 PCIe Block RAM Block RAM... Host PCIe Core 1... Core 4 Local Memory Interconnect Local Memory Interconnect Memory (DDR3 and SSD)

17 Kernel Structures OLA

18 Kernel Structures OLS AOLS Input Processed coefficients Output 2 nd launch 1 st launch Switch Global Memory 1 st launch 1 st launch: Bank1 2 nd launch 2 nd launch: Bank2 Global Memory 1 st launch: Bank2 2 nd launch: Bank1 Data Fetch and Multiplication Kernel (NDRange) Channels FFT/IFFT Kernel (Single) 1 st launch FFT 2 nd launch IFFT Channels Bit-Reverse Kernel (NDRange) Input Global Memory (Bank1) Processed coefficients Output FFT Data Fetch Kernel (NDRange) Global Memory (Bank2) IFFT Bit-Reverse Kernel (NDRange) Channels Channels FFT and Multiplication Kernel (Single) IFFT Kernel (Single) Channels Channels FFT Bit-Reverse Kernel (NDRange) Channels IFFT Data Fetch Kernel (NDRange) TOLS.

19 Outline Overview and Task 1 Overview and Task

20 Platform Overview and Task Table: Details of FPGA and GPU Platforms Device (Board) Terasic DE5-Net Sapphire Nitro R7 370 Hardware Intel Stratix V 5SGXA7 AMD Radeon R7 370 Technology 28nm 28nm 622,000 LEs Compute resource 256 DSP blocks 1024 Stream Processors On-chip memory size 50Mb Global memory size 2 x 2GB DDR3 3GB GDDR5 Global memory frequency 800MHz 5, 600MHz Memory interface width 2 x 64-bit 256-bit Max clock frequency 985MHz OpenCL Max power consumption 150W

21 Latency TDFIR vs FDFIR 4 TDFIR Kernels Naïve OLA TD-Naïve-64S TD-Naïve-64N OLA-64S OLA-64N 5 FDFIR Kernels Naïve OLS FD-Naïve AOLS TOLS AOLS-1024 AOLS-2048 AOLS-4096 TOLS-1024

22 Latency TDFIR vs FDFIR Latencies of a single FPGA (Intel Stratix V A7) in processing same input array using 9 different OpenCL kernels: Kernel Execution Latency (ms) TD Naïve 64S TD Naïve 64N OLA 64S OLA 64N FD Naïve AOLS 1024 AOLS 2048 AOLS 4096 TOLS FIR Filter Length

23 Multiple FIR Filters Even fastest kernel cannot meet time limit => Implement multiple FIR filters in parallel Problem: Bandwidth of off-chip memory is main problem Solution: Do more processing! Calculate power of complex values (need input in next stage) Problem: Number of DSP blocks limits number of parallelisable filters Solution: Downscale the FFT engine input size: 8 points > 4 points

24 Multiple FIR Filters Even fastest kernel cannot meet time limit => Implement multiple FIR filters in parallel Problem: Bandwidth of off-chip memory is main problem Solution: Do more processing! Calculate power of complex values (need input in next stage) Problem: Number of DSP blocks limits number of parallelisable filters Solution: Downscale the FFT engine input size: 8 points > 4 points

25 Multiple FIR Filters Even fastest kernel cannot meet time limit => Implement multiple FIR filters in parallel Problem: Bandwidth of off-chip memory is main problem Solution: Do more processing! Calculate power of complex values (need input in next stage) Problem: Number of DSP blocks limits number of parallelisable filters Solution: Downscale the FFT engine input size: 8 points > 4 points

26 Multiple FIR Filters 1) Multiple FIR filters 2) Power of complex values Becomes optimisation problem! TOLS points AOLS points 2 x AOLS points AOLS-1024-P 8points 2 x AOLS-1024-P 8points AOLS-1024-P 4points 3 x AOLS-1024-P 4points AOLS-2048-P 4points 3 x AOLS-2048-P 4points Unused DSP blocks FFT Engine Element-wise multiplications Power

27 Multiple FIR Filters and FPGAs 600 latency of 84 FIR Filters (ms) Device 1 FPGA 2 FPGAs 3 FPGAs Kernel Peformance (GFLOPS) Device 1 FPGA 2 FPGAs 3 FPGAs 0 0 3xAOLS 1024 P 3xAOLS 2048 P 3xAOLS 4096 P OpenCL Kernels 3xAOLS 1024 P 3xAOLS 2048 P 3xAOLS 4096 P OpenCL Kernels 4 20 Power Efficiency (GFLOPS/watt) Device 1 FPGA 2 FPGAs 3 FPGAs Energy Dissipation (Joule) Device 1 FPGA 2 FPGAs 3 FPGAs 0 0 3xAOLS 1024 P 3xAOLS 2048 P 3xAOLS 4096 P OpenCL Kernels 3xAOLS 1024 P 3xAOLS 2048 P 3xAOLS 4096 P OpenCL Kernels

28 Latency FPGA vs GPU Latencies of a single GPU (AMD Radeon R7 370) and 3 FPGAs in processing 2 and 4 Million points: Kernel Execution Latency (ms) xAOLS 2048 P (4 Million) 3xAOLS 2048 P (2 Million) GPU FD (4 Million) GPU FD (2 Million) Total FIR Filters

29 Energy FPGA vs GPU Energy dissipation of single FPGA and GPU in executing the same task with different kernels: TOLS 1024 AOLS xAOLS 4096 P 3xAOLS 2048 P 3xAOLS 1024 P GPU FD Energy Dissipation (Joule)

30 Outline Overview and Task 1 Overview and Task

31 Harmonic-summing Input: Filter-output-plane (FOP, SPF points, ~1.33GBytes) Processing flow: 8 harmonic planes are generated based on FOP and the stretched planes One threshold for each row of each harmonic plane (overall: 85 8 = 680) Hundreds of candidates are recorded Output: Candidates each candidate contains the indexes of filter, harmonic, bin, and amplitude (up to 64-bit)

32 Harmonic-summing Problems: Input data size is too large (~1.33GBytes) On-chip memory size too small for all planes The cost of computation task is very cheap (SPF adds) and easy to parallelise Challenge: Off-chip memory bandwidth is issue Optimise data use (and reuse), computation not an issue

33 Conclusion Overview and Task FPGA-based implementation and optimisation of FT convolution (FIR filter), based on OLA and OLS algorithms High-level approaches to such tasks works well Covered large design space Easy porting and sharing with partners With multiple FPGAs, FPGA implementation has advantage over GPU in both performance (GFLOPS) and Energy efficiency

Accelerating the Pulsar Search Pipeline with FPGAs, Programmed in OpenCL

Accelerating the Pulsar Search Pipeline with FPGAs, Programmed in OpenCL Accelerating the Pulsar Search Pipeline with FPGAs, Programmed in OpenCL Oliver Sinnen, Tyrone Sherwin, and Haomiao Wang & Prabu Thiagaraj (Manchester Uni/Raman Research Institute, Bangalore) Parallel

More information

arxiv: v2 [cs.dc] 29 Jun 2018

arxiv: v2 [cs.dc] 29 Jun 2018 Journal of Astronomical Instrumentation c World Scientific Publishing Company Combining Multiple Optimized FPGA-based Pulsar Search Modules Using OpenCL Haomiao Wang, Prabu Thiagaraj, and Oliver Sinnen

More information

THE Square Kilometre Array (SKA) is built to extend our. Harmonic-summing Module of SKA on FPGA Optimising the Irregular Memory Accesses

THE Square Kilometre Array (SKA) is built to extend our. Harmonic-summing Module of SKA on FPGA Optimising the Irregular Memory Accesses 1 Harmonic-summing Module of SKA on FPGA Optimising the Irregular Memory Accesses Haomiao Wang, Prabu Thiagaraj, and Oliver Sinnen arxiv:1805.12258v2 [cs.dc] 29 Jun 2018 Abstract The Square Kilometre Array

More information

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Chi Zhang, Viktor K Prasanna University of Southern California {zhan527, prasanna}@usc.edu fpga.usc.edu ACM

More information

Evaluating the Potential of Graphics Processors for High Performance Embedded Computing

Evaluating the Potential of Graphics Processors for High Performance Embedded Computing Evaluating the Potential of Graphics Processors for High Performance Embedded Computing Shuai Mu, Chenxi Wang, Ming Liu, Yangdong Deng Department of Micro-/Nano-electronics Tsinghua University Outline

More information

Computed Tomography (CT) Scan Image Reconstruction on the SRC-7 David Pointer SRC Computers, Inc.

Computed Tomography (CT) Scan Image Reconstruction on the SRC-7 David Pointer SRC Computers, Inc. Computed Tomography (CT) Scan Image Reconstruction on the SRC-7 David Pointer SRC Computers, Inc. CT Image Reconstruction Herman Head Sinogram Herman Head Reconstruction CT Image Reconstruction for all

More information

ATS-GPU Real Time Signal Processing Software

ATS-GPU Real Time Signal Processing Software Transfer A/D data to at high speed Up to 4 GB/s transfer rate for PCIe Gen 3 digitizer boards Supports CUDA compute capability 2.0+ Designed to work with AlazarTech PCI Express waveform digitizers Optional

More information

Accelerating the acceleration search a case study. By Chris Laidler

Accelerating the acceleration search a case study. By Chris Laidler Accelerating the acceleration search a case study By Chris Laidler Optimization cycle Assess Test Parallelise Optimise Profile Identify the function or functions in which the application is spending most

More information

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference The 2017 IEEE International Symposium on Workload Characterization Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference Shin-Ying Lee

More information

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Organization Part II Ujjwal Guin, Assistant Professor Department of Electrical and Computer Engineering Auburn University, Auburn,

More information

XPU A Programmable FPGA Accelerator for Diverse Workloads

XPU A Programmable FPGA Accelerator for Diverse Workloads XPU A Programmable FPGA Accelerator for Diverse Workloads Jian Ouyang, 1 (ouyangjian@baidu.com) Ephrem Wu, 2 Jing Wang, 1 Yupeng Li, 1 Hanlin Xie 1 1 Baidu, Inc. 2 Xilinx Outlines Background - FPGA for

More information

Performing Multi-Phased Radar Processing with a Very Deep FPGA Pipeline

Performing Multi-Phased Radar Processing with a Very Deep FPGA Pipeline Performing Multi-Phased Radar Processing with a Very Deep FPGA Pipeline Jeffrey T. Muehring and John K. Antonio School of Computer Science University of Oklahoma antonio@ou.edu 2000 MAPLD Conference The

More information

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package High Performance Machine Learning Workshop Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package Matheus Souza, Lucas Maciel, Pedro Penna, Henrique Freitas 24/09/2018 Agenda Introduction

More information

Qsys and IP Core Integration

Qsys and IP Core Integration Qsys and IP Core Integration Stephen A. Edwards (after David Lariviere) Columbia University Spring 2016 IP Cores Altera s IP Core Integration Tools Connecting IP Cores IP Cores Cyclone V SoC: A Mix of

More information

Revolutionizing the Datacenter

Revolutionizing the Datacenter Power-Efficient Machine Learning using FPGAs on POWER Systems Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx Revolutionizing the Datacenter Join the Conversation #OpenPOWERSummit Top-5

More information

John W. Romein. Netherlands Institute for Radio Astronomy (ASTRON) Dwingeloo, the Netherlands

John W. Romein. Netherlands Institute for Radio Astronomy (ASTRON) Dwingeloo, the Netherlands Signal Processing on GPUs for Radio Telescopes John W. Romein Netherlands Institute for Radio Astronomy (ASTRON) Dwingeloo, the Netherlands 1 Overview radio telescopes six radio telescope algorithms on

More information

REAL TIME DIGITAL SIGNAL PROCESSING

REAL TIME DIGITAL SIGNAL PROCESSING REAL TIME DIGITAL SIGNAL PROCESSING UTN - FRBA 2011 www.electron.frba.utn.edu.ar/dplab Introduction Why Digital? A brief comparison with analog. Advantages Flexibility. Easily modifiable and upgradeable.

More information

Altera SDK for OpenCL

Altera SDK for OpenCL Altera SDK for OpenCL A novel SDK that opens up the world of FPGAs to today s developers Altera Technology Roadshow 2013 Today s News Altera today announces its SDK for OpenCL Altera Joins Khronos Group

More information

Machine Learning & Science Data Processing

Machine Learning & Science Data Processing Machine Learning & Science Data Processing Rob Lyon robert.lyon@manchester.ac.uk SKA Group University of Manchester Machine Learning (1) Collective term for branch of A.I. Uses statistical tools to make

More information

n N c CIni.o ewsrg.au

n N c CIni.o ewsrg.au @NCInews NCI and Raijin National Computational Infrastructure 2 Our Partners General purpose, highly parallel processors High FLOPs/watt and FLOPs/$ Unit of execution Kernel Separate memory subsystem GPGPU

More information

Studying GPU based RTC for TMT NFIRAOS

Studying GPU based RTC for TMT NFIRAOS Studying GPU based RTC for TMT NFIRAOS Lianqi Wang Thirty Meter Telescope Project RTC Workshop Dec 04, 2012 1 Outline Tomography with iterative algorithms on GPUs Matri vector multiply approach Assembling

More information

Interconnection Network for Tightly Coupled Accelerators Architecture

Interconnection Network for Tightly Coupled Accelerators Architecture Interconnection Network for Tightly Coupled Accelerators Architecture Toshihiro Hanawa, Yuetsu Kodama, Taisuke Boku, Mitsuhisa Sato Center for Computational Sciences University of Tsukuba, Japan 1 What

More information

CUDA. Matthew Joyner, Jeremy Williams

CUDA. Matthew Joyner, Jeremy Williams CUDA Matthew Joyner, Jeremy Williams Agenda What is CUDA? CUDA GPU Architecture CPU/GPU Communication Coding in CUDA Use cases of CUDA Comparison to OpenCL What is CUDA? What is CUDA? CUDA is a parallel

More information

Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU

Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU Lifan Xu Wei Wang Marco A. Alvarez John Cavazos Dongping Zhang Department of Computer and Information Science University of Delaware

More information

Using a Scalable Parallel 2D FFT for Image Enhancement

Using a Scalable Parallel 2D FFT for Image Enhancement Introduction Using a Scalable Parallel 2D FFT for Image Enhancement Yaniv Sapir Adapteva, Inc. Email: yaniv@adapteva.com Frequency domain operations on spatial or time data are often used as a means for

More information

Experts in Application Acceleration Synective Labs AB

Experts in Application Acceleration Synective Labs AB Experts in Application Acceleration 1 2009 Synective Labs AB Magnus Peterson Synective Labs Synective Labs quick facts Expert company within software acceleration Based in Sweden with offices in Gothenburg

More information

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks Naveen Suda, Vikas Chandra *, Ganesh Dasika *, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jae-sun Seo, Yu

More information

Parallel FIR Filters. Chapter 5

Parallel FIR Filters. Chapter 5 Chapter 5 Parallel FIR Filters This chapter describes the implementation of high-performance, parallel, full-precision FIR filters using the DSP48 slice in a Virtex-4 device. ecause the Virtex-4 architecture

More information

PoS(10th EVN Symposium)098

PoS(10th EVN Symposium)098 1 Joint Institute for VLBI in Europe P.O. Box 2, 7990 AA Dwingeloo, The Netherlands E-mail: szomoru@jive.nl The, a Joint Research Activity in the RadioNet FP7 2 programme, has as its aim the creation of

More information

OSKAR: Simulating data from the SKA

OSKAR: Simulating data from the SKA OSKAR: Simulating data from the SKA Oxford e-research Centre, 4 June 2014 Fred Dulwich, Ben Mort, Stef Salvini 1 Overview Simulating interferometer data for SKA: Radio interferometry basics. Measurement

More information

high performance medical reconstruction using stream programming paradigms

high performance medical reconstruction using stream programming paradigms high performance medical reconstruction using stream programming paradigms This Paper describes the implementation and results of CT reconstruction using Filtered Back Projection on various stream programming

More information

Pactron FPGA Accelerated Computing Solutions

Pactron FPGA Accelerated Computing Solutions Pactron FPGA Accelerated Computing Solutions Intel Xeon + Altera FPGA 2015 Pactron HJPC Corporation 1 Motivation for Accelerators Enhanced Performance: Accelerators compliment CPU cores to meet market

More information

Energy Optimizations for FPGA-based 2-D FFT Architecture

Energy Optimizations for FPGA-based 2-D FFT Architecture Energy Optimizations for FPGA-based 2-D FFT Architecture Ren Chen and Viktor K. Prasanna Ming Hsieh Department of Electrical Engineering University of Southern California Ganges.usc.edu/wiki/TAPAS Outline

More information

A Study of Data Partitioning on OpenCL-based FPGAs. Zeke Wang (NTU Singapore), Bingsheng He (NTU Singapore), Wei Zhang (HKUST)

A Study of Data Partitioning on OpenCL-based FPGAs. Zeke Wang (NTU Singapore), Bingsheng He (NTU Singapore), Wei Zhang (HKUST) A Study of Data Partitioning on OpenC-based FPGAs Zeke Wang (NTU Singapore), Bingsheng He (NTU Singapore), Wei Zhang (HKUST) 1 Outline Background and Motivations Data Partitioning on FPGA OpenC on FPGA

More information

Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1)

Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1) Department of Electr rical Eng ineering, Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1) 王振傑 (Chen-Chieh Wang) ccwang@mail.ee.ncku.edu.tw ncku edu Depar rtment of Electr rical Engineering,

More information

Martin Dubois, ing. Contents

Martin Dubois, ing. Contents Martin Dubois, ing Contents Without OpenNet vs With OpenNet Technical information Possible applications Artificial Intelligence Deep Packet Inspection Image and Video processing Network equipment development

More information

Gzip Compression Using Altera OpenCL. Mohamed Abdelfattah (University of Toronto) Andrei Hagiescu Deshanand Singh

Gzip Compression Using Altera OpenCL. Mohamed Abdelfattah (University of Toronto) Andrei Hagiescu Deshanand Singh Gzip Compression Using Altera OpenCL Mohamed Abdelfattah (University of Toronto) Andrei Hagiescu Deshanand Singh Gzip Widely-used lossless compression program Gzip = LZ77 + Huffman Big data needs fast

More information

FPGA Acceleration of 3D Component Matching using OpenCL

FPGA Acceleration of 3D Component Matching using OpenCL FPGA Acceleration of 3D Component Introduction 2D component matching, blob extraction or region extraction, is commonly used in computer vision for detecting connected regions that meet pre-determined

More information

Tracking Acceleration with FPGAs. Future Tracking, CMS Week 4/12/17 Sioni Summers

Tracking Acceleration with FPGAs. Future Tracking, CMS Week 4/12/17 Sioni Summers Tracking Acceleration with FPGAs Future Tracking, CMS Week 4/12/17 Sioni Summers Contents Introduction FPGAs & 'DataFlow Engines' for computing Device architecture Maxeler HLT Tracking Acceleration 2 Introduction

More information

GPUs and Emerging Architectures

GPUs and Emerging Architectures GPUs and Emerging Architectures Mike Giles mike.giles@maths.ox.ac.uk Mathematical Institute, Oxford University e-infrastructure South Consortium Oxford e-research Centre Emerging Architectures p. 1 CPUs

More information

A Parallel Hardware Architecture for Information-Theoretic Adaptive Filtering

A Parallel Hardware Architecture for Information-Theoretic Adaptive Filtering A Parallel Hardware Architecture for Information-Theoretic Adaptive Filtering HPRCTA 2010 Stefan Craciun Dr. Alan D. George Dr. Herman Lam Dr. Jose C. Principe November 14, 2010 NSF CHREC Center ECE Department,

More information

AMD Embedded PCIe ADD-IN BOARD E6760/E6460 Datasheet. (ER93FLA/ER91FLA-xx)

AMD Embedded PCIe ADD-IN BOARD E6760/E6460 Datasheet. (ER93FLA/ER91FLA-xx) AMD Embedded PCIe ADD-IN BOARD E6760/E6460 Datasheet (ER93FLA/ER91FLA-xx) CONTENTS 1. Feature... 3 2. Functional Overview... 4 2.1. Memory Interface... 4 2.2. Acceleration Features... 4 2.3. Avivo Display

More information

FAST FIR FILTERS FOR SIMD PROCESSORS WITH LIMITED MEMORY BANDWIDTH

FAST FIR FILTERS FOR SIMD PROCESSORS WITH LIMITED MEMORY BANDWIDTH Key words: Digital Signal Processing, FIR filters, SIMD processors, AltiVec. Grzegorz KRASZEWSKI Białystok Technical University Department of Electrical Engineering Wiejska

More information

GPU Programming with Ateji PX June 8 th Ateji All rights reserved.

GPU Programming with Ateji PX June 8 th Ateji All rights reserved. GPU Programming with Ateji PX June 8 th 2010 Ateji All rights reserved. Goals Write once, run everywhere, even on a GPU Target heterogeneous architectures from Java GPU accelerators OpenCL standard Get

More information

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

GPU Computation Strategies & Tricks. Ian Buck NVIDIA GPU Computation Strategies & Tricks Ian Buck NVIDIA Recent Trends 2 Compute is Cheap parallelism to keep 100s of ALUs per chip busy shading is highly parallel millions of fragments per frame 0.5mm 64-bit

More information

High Performance DoD DSP Applications

High Performance DoD DSP Applications High Performance DoD DSP Applications Robert Bond Embedded Digital Systems Group 23 August 2003 Slide-1 Outline DoD High-Performance DSP Applications Middleware (with some streaming constructs) Future

More information

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

SDA: Software-Defined Accelerator for Large- Scale DNN Systems SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang, 1 Shiding Lin, 1 Wei Qi, Yong Wang, Bo Yu, Song Jiang, 2 1 Baidu, Inc. 2 Wayne State University Introduction of Baidu A dominant

More information

New Zealand Involvement in Solving the SKA Computing Challenges

New Zealand Involvement in Solving the SKA Computing Challenges New Zealand Involvement in Solving the SKA Computing Challenges D R ANDREW E N S O R D I R ECTO R H P C R ESEARC H L A B O R ATORY/ D I R ECTOR N Z SKA ALLIANCE COMPUTING FO R S K A COLLO Q U I UM 2 0

More information

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC

More information

Field Programmable Gate Array (FPGA) Devices

Field Programmable Gate Array (FPGA) Devices Field Programmable Gate Array (FPGA) Devices 1 Contents Altera FPGAs and CPLDs CPLDs FPGAs with embedded processors ACEX FPGAs Cyclone I,II FPGAs APEX FPGAs Stratix FPGAs Stratix II,III FPGAs Xilinx FPGAs

More information

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

SDA: Software-Defined Accelerator for Large- Scale DNN Systems SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang, 1 Shiding Lin, 1 Wei Qi, 1 Yong Wang, 1 Bo Yu, 1 Song Jiang, 2 1 Baidu, Inc. 2 Wayne State University Introduction of Baidu A

More information

ReNoC: A Network-on-Chip Architecture with Reconfigurable Topology

ReNoC: A Network-on-Chip Architecture with Reconfigurable Topology 1 ReNoC: A Network-on-Chip Architecture with Reconfigurable Topology Mikkel B. Stensgaard and Jens Sparsø Technical University of Denmark Technical University of Denmark Outline 2 Motivation ReNoC Basic

More information

Versal: AI Engine & Programming Environment

Versal: AI Engine & Programming Environment Engineering Director, Xilinx Silicon Architecture Group Versal: Engine & Programming Environment Presented By Ambrose Finnerty Xilinx DSP Technical Marketing Manager October 16, 2018 MEMORY MEMORY MEMORY

More information

FPGA Polyphase Filter Bank Study & Implementation

FPGA Polyphase Filter Bank Study & Implementation FPGA Polyphase Filter Bank Study & Implementation Raghu Rao Matthieu Tisserand Mike Severa Prof. John Villasenor Image Communications/. Electrical Engineering Dept. UCLA 1 Introduction This document describes

More information

AMD HD7750 2GB PCIEx16

AMD HD7750 2GB PCIEx16 AMD HD7750 2GB PCIEx16 ADVANTECH MODEL: GFX-AH7750L16-5J MPN number: 1A1-E000130ADP Performance PCIe Graphics 4 x Mini DP CONTENTS 1. Specification... 3 2. Functional Overview... 4 2.1. Memory Interface...

More information

Intel HLS Compiler: Fast Design, Coding, and Hardware

Intel HLS Compiler: Fast Design, Coding, and Hardware white paper Intel HLS Compiler Intel HLS Compiler: Fast Design, Coding, and Hardware The Modern FPGA Workflow Authors Melissa Sussmann HLS Product Manager Intel Corporation Tom Hill OpenCL Product Manager

More information

GPUS FOR NGVLA. M Clark, April 2015

GPUS FOR NGVLA. M Clark, April 2015 S FOR NGVLA M Clark, April 2015 GAMING DESIGN ENTERPRISE VIRTUALIZATION HPC & CLOUD SERVICE PROVIDERS AUTONOMOUS MACHINES PC DATA CENTER MOBILE The World Leader in Visual Computing 2 What is a? Tesla K40

More information

A HT3 Platform for Rapid Prototyping and High Performance Reconfigurable Computing

A HT3 Platform for Rapid Prototyping and High Performance Reconfigurable Computing A HT3 Platform for Rapid Prototyping and High Performance Reconfigurable Computing Second International Workshop on HyperTransport Research and Application (WHTRA 2011) University of Heidelberg Computer

More information

White Paper Taking Advantage of Advances in FPGA Floating-Point IP Cores

White Paper Taking Advantage of Advances in FPGA Floating-Point IP Cores White Paper Recently available FPGA design tools and IP provide a substantial reduction in computational resources, as well as greatly easing the implementation effort in a floating-point datapath. Moreover,

More information

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Outline History & Motivation Architecture Core architecture Network Topology Memory hierarchy Brief comparison to GPU & Tilera Programming Applications

More information

AMD HD7750 PCIe ADD-IN BOARD. Datasheet (GFX-A3T2-01FST1)

AMD HD7750 PCIe ADD-IN BOARD. Datasheet (GFX-A3T2-01FST1) AMD HD7750 PCIe ADD-IN BOARD Datasheet (GFX-A3T2-01FST1) CONTENTS 1. Feature... 3 2. Functional Overview... 4 2.1. Memory Interface... 4 2.2. Memory Aperture Size... 4 2.3. Avivo Display System... 5 2.4.

More information

FPGAs & Multi-FPGA Systems. FPGA Abstract Model. Logic cells imbedded in a general routing structure. Logic cells usually contain:

FPGAs & Multi-FPGA Systems. FPGA Abstract Model. Logic cells imbedded in a general routing structure. Logic cells usually contain: s & Multi- Systems Fit logic into a prefabricated system Fixed inter-chip routing Fixed on-chip logic & routing XBA Partitioning Global outing Technology Map. XBA XBA Placement outing 23 Abstract Model

More information

The University of Adelaide, School of Computer Science 13 September 2018

The University of Adelaide, School of Computer Science 13 September 2018 Computer Architecture A Quantitative Approach, Sixth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per

More information

IMPLEMENTATION OF DISTRIBUTED CANNY EDGE DETECTOR ON FPGA

IMPLEMENTATION OF DISTRIBUTED CANNY EDGE DETECTOR ON FPGA IMPLEMENTATION OF DISTRIBUTED CANNY EDGE DETECTOR ON FPGA T. Rupalatha 1, Mr.C.Leelamohan 2, Mrs.M.Sreelakshmi 3 P.G. Student, Department of ECE, C R Engineering College, Tirupati, India 1 Associate Professor,

More information

GPU Architecture. Alan Gray EPCC The University of Edinburgh

GPU Architecture. Alan Gray EPCC The University of Edinburgh GPU Architecture Alan Gray EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? Architectural reasons for accelerator performance advantages Latest GPU Products From

More information

Lecture 1: Gentle Introduction to GPUs

Lecture 1: Gentle Introduction to GPUs CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming Lecture 1: Gentle Introduction to GPUs Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Who Am I? Mohamed

More information

The Nios II Family of Configurable Soft-core Processors

The Nios II Family of Configurable Soft-core Processors The Nios II Family of Configurable Soft-core Processors James Ball August 16, 2005 2005 Altera Corporation Agenda Nios II Introduction Configuring your CPU FPGA vs. ASIC CPU Design Instruction Set Architecture

More information

Optimizing CNN-based Object Detection Algorithms on Embedded FPGA Platforms

Optimizing CNN-based Object Detection Algorithms on Embedded FPGA Platforms Optimizing CNN-based Object Detection Algorithms on Embedded FPGA Platforms Ruizhe Zhao 1, Xinyu Niu 1, Yajie Wu 2, Wayne Luk 1, and Qiang Liu 3 1 Imperial College London {ruizhe.zhao15,niu.xinyu10,w.luk}@imperial.ac.uk

More information

Energy Efficient Transparent Library Accelera4on with CAPI Heiner Giefers IBM Research Zurich

Energy Efficient Transparent Library Accelera4on with CAPI Heiner Giefers IBM Research Zurich Energy Efficient Transparent Library Accelera4on with CAPI Heiner Giefers IBM Research Zurich Revolu'onizing the Datacenter Datacenter Join the Conversa'on #OpenPOWERSummit Towards highly efficient data

More information

Generic Polyphase Filterbanks with CUDA

Generic Polyphase Filterbanks with CUDA Generic Polyphase Filterbanks with CUDA Jan Krämer German Aerospace Center Communication and Navigation Satellite Networks Weßling 04.02.2017 Knowledge for Tomorrow www.dlr.de Slide 1 of 27 > Generic Polyphase

More information

Adapted from David Patterson s slides on graduate computer architecture

Adapted from David Patterson s slides on graduate computer architecture Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Ten Advanced Optimizations of Cache Performance Memory Technology and Optimizations Virtual Memory and Virtual

More information

Flexible Architecture Research Machine (FARM)

Flexible Architecture Research Machine (FARM) Flexible Architecture Research Machine (FARM) RAMP Retreat June 25, 2009 Jared Casper, Tayo Oguntebi, Sungpack Hong, Nathan Bronson Christos Kozyrakis, Kunle Olukotun Motivation Why CPUs + FPGAs make sense

More information

Understanding Peak Floating-Point Performance Claims

Understanding Peak Floating-Point Performance Claims white paper FPGA Understanding Peak ing-point Performance Claims Learn how to calculate and compare the peak floating-point capabilities of digital signal processors (DSPs), graphics processing units (GPUs),

More information

Introduction to Microprocessor

Introduction to Microprocessor Introduction to Microprocessor Slide 1 Microprocessor A microprocessor is a multipurpose, programmable, clock-driven, register-based electronic device That reads binary instructions from a storage device

More information

Fast Fourier Transform IP Core v1.0 Block Floating-Point Streaming Radix-2 Architecture. Introduction. Features. Data Sheet. IPC0002 October 2014

Fast Fourier Transform IP Core v1.0 Block Floating-Point Streaming Radix-2 Architecture. Introduction. Features. Data Sheet. IPC0002 October 2014 Introduction The FFT/IFFT IP core is a highly configurable Fast Fourier Transform (FFT) and Inverse Fast Fourier Transform (IFFT) VHDL IP component. The core performs an N-point complex forward or inverse

More information

Energy Efficient Adaptive Beamforming on Sensor Networks

Energy Efficient Adaptive Beamforming on Sensor Networks Energy Efficient Adaptive Beamforming on Sensor Networks Viktor K. Prasanna Bhargava Gundala, Mitali Singh Dept. of EE-Systems University of Southern California email: prasanna@usc.edu http://ceng.usc.edu/~prasanna

More information

Accelerating computation with FPGAs

Accelerating computation with FPGAs Accelerating computation with FPGAs Michael J. Flynn Maxeler Technologies and Stanford University M. J. Flynn Maxeler Technologies 1 Based on work done by my colleagues at Maxeler, especially Oskar Mencer,

More information

General Purpose GPU Computing in Partial Wave Analysis

General Purpose GPU Computing in Partial Wave Analysis JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data

More information

Multimedia in Mobile Phones. Architectures and Trends Lund

Multimedia in Mobile Phones. Architectures and Trends Lund Multimedia in Mobile Phones Architectures and Trends Lund 091124 Presentation Henrik Ohlsson Contact: henrik.h.ohlsson@stericsson.com Working with multimedia hardware (graphics and displays) at ST- Ericsson

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

FPGA VHDL Design Flow AES128 Implementation

FPGA VHDL Design Flow AES128 Implementation Sakinder Ali FPGA VHDL Design Flow AES128 Implementation Field Programmable Gate Array Basic idea: two-dimensional array of logic blocks and flip-flops with a means for the user to configure: 1. The interconnection

More information

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune PORTING CP2K TO THE INTEL XEON PHI ARCHER Technical Forum, Wed 30 th July Iain Bethune (ibethune@epcc.ed.ac.uk) Outline Xeon Phi Overview Porting CP2K to Xeon Phi Performance Results Lessons Learned Further

More information

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA Yufei Ma, Naveen Suda, Yu Cao, Jae-sun Seo, Sarma Vrudhula School of Electrical, Computer and Energy Engineering School

More information

ECE 571 Advanced Microprocessor-Based Design Lecture 20

ECE 571 Advanced Microprocessor-Based Design Lecture 20 ECE 571 Advanced Microprocessor-Based Design Lecture 20 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 12 April 2016 Project/HW Reminder Homework #9 was posted 1 Raspberry Pi

More information

CHAPTER 4. DIGITAL DOWNCONVERTER FOR WiMAX SYSTEM

CHAPTER 4. DIGITAL DOWNCONVERTER FOR WiMAX SYSTEM CHAPTER 4 IMPLEMENTATION OF DIGITAL UPCONVERTER AND DIGITAL DOWNCONVERTER FOR WiMAX SYSTEM 4.1 Introduction FPGAs provide an ideal implementation platform for developing broadband wireless systems such

More information

When MPPDB Meets GPU:

When MPPDB Meets GPU: When MPPDB Meets GPU: An Extendible Framework for Acceleration Laura Chen, Le Cai, Yongyan Wang Background: Heterogeneous Computing Hardware Trend stops growing with Moore s Law Fast development of GPU

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology

More information

Low Cost FPGA Implementation of Fresnel Transform for Digital Holography

Low Cost FPGA Implementation of Fresnel Transform for Digital Holography Proceedings of the 2 nd World Congress on Electrical Engineering and Computer Systems and Science (EECSS 16) Budapest, Hungary - August 16-17, 2016 Paper No. EEE 129 DOI: 10.11159/eee16.129 Low Cost FPGA

More information

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per

More information

An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection

An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection Hiroyuki Usui, Jun Tanabe, Toru Sano, Hui Xu, and Takashi Miyamori Toshiba Corporation, Kawasaki, Japan Copyright 2013,

More information

Vector IRAM: A Microprocessor Architecture for Media Processing

Vector IRAM: A Microprocessor Architecture for Media Processing IRAM: A Microprocessor Architecture for Media Processing Christoforos E. Kozyrakis kozyraki@cs.berkeley.edu CS252 Graduate Computer Architecture February 10, 2000 Outline Motivation for IRAM technology

More information

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand

More information

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY 1 Memories: Review SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: value is stored

More information

Building NVLink for Developers

Building NVLink for Developers Building NVLink for Developers Unleashing programmatic, architectural and performance capabilities for accelerated computing Why NVLink TM? Simpler, Better and Faster Simplified Programming No specialized

More information

"On the Capability and Achievable Performance of FPGAs for HPC Applications"

On the Capability and Achievable Performance of FPGAs for HPC Applications "On the Capability and Achievable Performance of FPGAs for HPC Applications" Wim Vanderbauwhede School of Computing Science, University of Glasgow, UK Or in other words "How Fast Can Those FPGA Thingies

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic disk 5ms 20ms, $0.20 $2 per

More information

General Purpose GPU Programming. Advanced Operating Systems Tutorial 9

General Purpose GPU Programming. Advanced Operating Systems Tutorial 9 General Purpose GPU Programming Advanced Operating Systems Tutorial 9 Tutorial Outline Review of lectured material Key points Discussion OpenCL Future directions 2 Review of Lectured Material Heterogeneous

More information

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA Junzhong Shen, You Huang, Zelong Wang, Yuran Qiao, Mei Wen, Chunyuan Zhang National University of Defense Technology,

More information

CSE 599 I Accelerated Computing - Programming GPUS. Memory performance

CSE 599 I Accelerated Computing - Programming GPUS. Memory performance CSE 599 I Accelerated Computing - Programming GPUS Memory performance GPU Teaching Kit Accelerated Computing Module 6.1 Memory Access Performance DRAM Bandwidth Objective To learn that memory bandwidth

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information